The file ‘WeatherEDA_202005_v002.Rmd’ contains exploratory data analysis for historical weather data as contained in METAR archives hosted by Iowa State University.
Data have been dowloaded, processed, cleaned, and integrated for several stations (airports) and years, with .rds files saved in “./RInputFiles/ProcessedMETAR”.
This module will perform initial modeling on the processed weather files. It builds on the previous ‘WeatherModeling_202006_v001.Rmd’ and leverages functions in ‘WeatherModelingFunctions_v001.R’.
There are three main processed files available for further exploration:
metar_postEDA_20200617.rds and metar_postEDA_extra_20200627.rds
metar_modifiedClouds_20200617.rds and metar_modifiedclouds_extra_20200627.rds
metar_precipLists_20200617.rds and metar_precipLists_extra_20200627.rds
Several mapping files are defined for use in plotting; tidyverse, lubridate, and caret are loaded; and the relevant functions are sourced:
# The process frequently uses tidyverse, lubridate, caret, and randomForest
library(tidyverse)
## -- Attaching packages ------------------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.2.1 v purrr 0.3.3
## v tibble 2.1.3 v dplyr 0.8.4
## v tidyr 1.0.2 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## -- Conflicts ---------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(lubridate)
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
# The main path for the files
filePath <- "./RInputFiles/ProcessedMETAR/"
# Sourcing functions
source("./WeatherModelingFunctions_v001.R")
# Descriptive names for key variables
varMapper <- c(source="Original source file name",
locale="Descriptive name",
dtime="Date-Time (UTC)",
origMETAR="Original METAR",
year="Year",
monthint="Month",
month="Month",
day="Day of Month",
WindDir="Wind Direction (degrees)",
WindSpeed="Wind Speed (kts)",
WindGust="Wind Gust (kts)",
predomDir="General Prevailing Wind Direction",
Visibility="Visibility (SM)",
Altimeter="Altimeter (inches Hg)",
TempF="Temperature (F)",
DewF="Dew Point (F)",
modSLP="Sea-Level Pressure (hPa)",
cType1="First Cloud Layer Type",
cLevel1="First Cloud Layer Height (ft)",
isRain="Rain at Observation Time",
isSnow="Snow at Observation Time",
isThunder="Thunder at Obsevation Time",
tempFHi="24-hour High Temperature (F)",
tempFLo="24-hour Low Temperature (F)",
minHeight="Minimum Cloud Height (ft)",
minType="Obscuration Level at Minimum Cloud Height",
ceilingHeight="Minimum Ceiling Height (ft)",
ceilingType="Obscuration Level at Minimum Ceiling Height",
hr="Hour of Day (Zulu time)",
hrfct="Hour of Day (Zulu time)",
locNamefct="Locale Name"
)
# File name to city name mapper
cityNameMapper <- c(katl_2016="Atlanta, GA (2016)",
kbos_2016="Boston, MA (2016)",
kdca_2016="Washington, DC (2016)",
kden_2016="Denver, CO (2016)",
kdfw_2016="Dallas, TX (2016)",
kdtw_2016="Detroit, MI (2016)",
kewr_2016="Newark, NJ (2016)",
kgrb_2016="Green Bay, WI (2016)",
kgrr_2016="Grand Rapids, MI (2016)",
kiah_2016="Houston, TX (2016)",
kind_2016="Indianapolis, IN (2016)",
klas_2014="Las Vegas, NV (2014)",
klas_2015="Las Vegas, NV (2015)",
klas_2016="Las Vegas, NV (2016)",
klas_2017="Las Vegas, NV (2017)",
klas_2018="Las Vegas, NV (2018)",
klas_2019="Las Vegas, NV (2019)",
klax_2016="Los Angeles, CA (2016)",
klnk_2016="Lincoln, NE (2016)",
kmia_2016="Miami, FL (2016)",
kmke_2016="Milwaukee, WI (2016)",
kmsn_2016="Madison, WI (2016)",
kmsp_2016="Minneapolis, MN (2016)",
kmsy_2014="New Orleans, LA (2014)",
kmsy_2015="New Orleans, LA (2015)",
kmsy_2016="New Orleans, LA (2016)",
kmsy_2017="New Orleans, LA (2017)",
kmsy_2018="New Orleans, LA (2018)",
kmsy_2019="New Orleans, LA (2019)",
kord_2014="Chicago, IL (2014)",
kord_2015="Chicago, IL (2015)",
kord_2016="Chicago, IL (2016)",
kord_2017="Chicago, IL (2017)",
kord_2018="Chicago, IL (2018)",
kord_2019="Chicago, IL (2019)",
kphl_2016="Philadelphia, PA (2016)",
kphx_2016="Phoenix, AZ (2016)",
ksan_2014="San Diego, CA (2014)",
ksan_2015="San Diego, CA (2015)",
ksan_2016="San Diego, CA (2016)",
ksan_2017="San Diego, CA (2017)",
ksan_2018="San Diego, CA (2018)",
ksan_2019="San Diego, CA (2019)",
ksat_2016="San Antonio, TX (2016)",
ksea_2016="Seattle, WA (2016)",
ksfo_2016="San Francisco, CA (2016)",
ksjc_2016="San Jose, CA (2016)",
kstl_2016="Saint Louis, MO (2016)",
ktpa_2016="Tampa Bay, FL (2016)",
ktvc_2016="Traverse City, MI (2016)"
)
# File names in 2016, based on cityNameMapper
names_2016 <- grep(names(cityNameMapper), pattern="[a-z]{3}_2016", value=TRUE)
The main data will be from the metar_postEDA files. They are integrated below, and cloud and ceiling heights are converted to factors:
# Main weather data
metarData <- readRDS("./RInputFiles/ProcessedMETAR/metar_postEDA_20200617.rds") %>%
bind_rows(readRDS("./RInputFiles/ProcessedMETAR/metar_postEDA_extra_20200627.rds")) %>%
mutate(orig_minHeight=minHeight,
orig_ceilingHeight=ceilingHeight,
minHeight=mapCloudHeight(minHeight),
ceilingHeight=mapCloudHeight(ceilingHeight)
)
glimpse(metarData)
## Observations: 437,417
## Variables: 43
## $ source <chr> "kdtw_2016", "kdtw_2016", "kdtw_2016", "kdtw_201...
## $ locale <chr> "Detroit, MI (2016)", "Detroit, MI (2016)", "Det...
## $ dtime <dttm> 2016-01-01 00:53:00, 2016-01-01 01:53:00, 2016-...
## $ origMETAR <chr> "KDTW 010053Z 23012KT 10SM OVC020 00/M05 A3021 R...
## $ year <dbl> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, ...
## $ monthint <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ month <fct> Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan...
## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ WindDir <chr> "230", "230", "230", "240", "230", "220", "220",...
## $ WindSpeed <int> 12, 12, 11, 14, 16, 13, 14, 16, 13, 16, 17, 13, ...
## $ WindGust <dbl> NA, NA, NA, 23, 22, NA, 20, 20, NA, 22, NA, NA, ...
## $ predomDir <fct> SW, SW, SW, SW, SW, SW, SW, SW, SW, SW, SW, SW, ...
## $ Visibility <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 8, 5, 7, 8, 10, ...
## $ Altimeter <dbl> 30.21, 30.21, 30.19, 30.19, 30.18, 30.16, 30.14,...
## $ TempF <dbl> 32.00, 32.00, 32.00, 30.92, 30.92, 32.00, 30.92,...
## $ DewF <dbl> 23.00, 21.92, 21.02, 19.94, 19.94, 19.94, 19.94,...
## $ modSLP <dbl> 1023.6, 1023.5, 1023.0, 1023.0, 1022.7, 1022.0, ...
## $ cType1 <chr> "OVC", "OVC", "OVC", "OVC", "OVC", "OVC", "OVC",...
## $ cType2 <chr> "", "", "", "", "", "", "", "", "", "OVC", "OVC"...
## $ cType3 <chr> "", "", "", "", "", "", "", "", "", "", "", "", ...
## $ cType4 <chr> "", "", "", "", "", "", "", "", "", "", "", "", ...
## $ cType5 <chr> "", "", "", "", "", "", "", "", "", "", "", "", ...
## $ cType6 <chr> "", "", "", "", "", "", "", "", "", "", "", "", ...
## $ cLevel1 <dbl> 2000, 2000, 2200, 2500, 2700, 2500, 2500, 2500, ...
## $ cLevel2 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, 4500, 4000, ...
## $ cLevel3 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ cLevel4 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ cLevel5 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ cLevel6 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ isRain <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,...
## $ isSnow <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,...
## $ isThunder <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,...
## $ p1Inches <dbl> NA, NA, NA, NA, NA, NA, NA, NA, 0, 0, 0, 0, 0, N...
## $ p36Inches <dbl> NA, NA, NA, NA, NA, NA, NA, NA, 0, NA, NA, 0, NA...
## $ p24Inches <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ tempFHi <dbl> NA, NA, NA, NA, 36, NA, NA, NA, NA, NA, NA, NA, ...
## $ tempFLo <dbl> NA, NA, NA, NA, 31, NA, NA, NA, NA, NA, NA, NA, ...
## $ minHeight <fct> Low, Low, Low, Low, Low, Low, Low, Low, Low, Low...
## $ minType <fct> OVC, OVC, OVC, OVC, OVC, OVC, OVC, OVC, OVC, BKN...
## $ ceilingHeight <fct> Low, Low, Low, Low, Low, Low, Low, Low, Low, Low...
## $ ceilingType <fct> OVC, OVC, OVC, OVC, OVC, OVC, OVC, OVC, OVC, BKN...
## $ orig_minHeight <dbl> 2000, 2000, 2200, 2500, 2700, 2500, 2500, 2500, ...
## $ orig_ceilingHeight <dbl> 2000, 2000, 2200, 2500, 2700, 2500, 2500, 2500, ...
An initial exploration of the data can use hierarchical clustering based on several numerical features of the data (temperature, dew point, sea-level pressure, wind speed) by month.
Example code includes:
# Create hierarchical clustering for 2016 data
# Distance is calculated only where month is the same
distData <- metarData %>%
filter(year==2016) %>%
select(locale, month, TempF, DewF, modSLP, WindSpeed) %>%
group_by(locale, month) %>%
summarize_if(is.numeric, mean, na.rm=TRUE) %>%
ungroup() %>%
mutate(rowname=paste0(locale, "_", month)) %>%
column_to_rownames() %>%
select(-locale, -month) %>%
as.matrix() %>%
scale() %>%
dist() %>%
as.matrix() %>%
as.data.frame() %>%
rownames_to_column(var="locale1") %>%
pivot_longer(-locale1, names_to="locale2", values_to="dist") %>%
mutate(city1=str_replace(locale1, "_\\w{3}", ""),
city2=str_replace(locale2, "_\\w{3}", ""),
month1=str_sub(locale1, -3),
month2=str_sub(locale2, -3)) %>%
filter(month1==month2) %>%
group_by(city1, city2) %>%
summarize(meandist=mean(dist), sddist=sd(dist)) %>%
ungroup() %>%
select(-sddist) %>%
pivot_wider(city1, names_from="city2", values_from="meandist") %>%
column_to_rownames(var="city1") %>%
as.matrix() %>%
as.dist(diag=FALSE)
distData %>%
hclust(method="complete") %>%
plot()
distData %>%
hclust(method="single") %>%
plot()
At a glance, there are several sensible findings from the clustering:
There are a handful of cities that do not seem to cluster with anything else:
Suppose that the only information available about two cities were their temperatures and dew points, with month also included. How well would a basic random forest, with mtry=3, classify the cities?
The function is then run for every combination of locales from 2016 in cityNameMapper. A common random seed is applied to every run of the process:
# Create a container list to hold the output
list_2016_TempF_DewF_month <- vector("list", 0.5*length(names_2016)*(length(names_2016)-1))
set.seed(2006281443)
n <- 1
for (ctr in 1:(length(names_2016)-1)) {
for (ctr2 in (ctr+1):length(names_2016)) {
list_2016_TempF_DewF_month[[n]] <- rfTwoLocales(metarData,
loc1=names_2016[ctr],
loc2=names_2016[ctr2],
vrbls=c("TempF", "DewF", "month"),
ntree=25
)
n <- n + 1
if ((n %% 50) == 0) { cat("Through number:", n, "\n")}
}
}
## Through number: 50
## Through number: 100
## Through number: 150
## Through number: 200
## Through number: 250
## Through number: 300
## Through number: 350
## Through number: 400
Accuracy can then be assessed:
# Create a tibble from the underlying accuracy data
acc_TempF_DewF_month <- map_dfr(list_2016_TempF_DewF_month, .f=helperAccuracyLocale)
# Assess the top 10 classification accuracies
acc_TempF_DewF_month %>%
arrange(-accOverall) %>%
head(10)
## # A tibble: 10 x 5
## locale1 locale2 accOverall accLocale1 accLocale2
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 Denver, CO (2016) Miami, FL (2016) 0.997 0.997 0.997
## 2 Las Vegas, NV (2016) Miami, FL (2016) 0.990 0.991 0.989
## 3 Miami, FL (2016) Seattle, WA (2016) 0.985 0.985 0.985
## 4 Denver, CO (2016) Tampa Bay, FL (2016) 0.984 0.986 0.982
## 5 Miami, FL (2016) Traverse City, MI (201~ 0.980 0.981 0.980
## 6 Miami, FL (2016) Minneapolis, MN (2016) 0.978 0.978 0.978
## 7 Boston, MA (2016) Miami, FL (2016) 0.976 0.975 0.976
## 8 Denver, CO (2016) New Orleans, LA (2016) 0.975 0.975 0.975
## 9 Miami, FL (2016) Phoenix, AZ (2016) 0.974 0.973 0.975
## 10 Green Bay, WI (2016) Miami, FL (2016) 0.971 0.972 0.971
# Assess the bottom 10 classification accuracies
acc_TempF_DewF_month %>%
arrange(accOverall) %>%
head(10)
## # A tibble: 10 x 5
## locale1 locale2 accOverall accLocale1 accLocale2
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 Newark, NJ (2016) Philadelphia, PA (20~ 0.580 0.591 0.570
## 2 Chicago, IL (2016) Milwaukee, WI (2016) 0.591 0.600 0.583
## 3 Chicago, IL (2016) Madison, WI (2016) 0.598 0.622 0.573
## 4 Detroit, MI (2016) Grand Rapids, MI (20~ 0.608 0.568 0.649
## 5 Chicago, IL (2016) Detroit, MI (2016) 0.610 0.633 0.587
## 6 Grand Rapids, MI (201~ Traverse City, MI (2~ 0.613 0.630 0.597
## 7 Madison, WI (2016) Milwaukee, WI (2016) 0.618 0.625 0.611
## 8 Green Bay, WI (2016) Madison, WI (2016) 0.619 0.586 0.652
## 9 Detroit, MI (2016) Madison, WI (2016) 0.619 0.639 0.598
## 10 Chicago, IL (2016) Grand Rapids, MI (20~ 0.621 0.647 0.595
allAccuracy_month <- select(acc_TempF_DewF_month,
locale=locale1,
other=locale2,
accOverall,
accLocale=accLocale1
) %>%
bind_rows(select(acc_TempF_DewF_month,
locale=locale2,
other=locale1,
accOverall,
accLocale=accLocale2
)
)
# Overall accuracy by location plot
allAccuracy_month %>%
group_by(locale) %>%
summarize_if(is.numeric, mean) %>%
ggplot(aes(x=fct_reorder(locale, accOverall), y=accOverall)) +
geom_point(size=2) +
geom_text(aes(y=accOverall+0.02, label=paste0(round(100*accOverall), "%"))) +
coord_flip() +
labs(x="",
y="Average Accuracy",
title="Average Accuracy Predicting Locale",
subtitle="Predictions made 1:1 to each other locale (average accuracy reported)",
caption="Temperature, Dew Point, Month as predictors\n(50% is baseline coinflip accuracy)"
) +
ylim(c(0.5, 1))
# Overall accuracy heatmap (FIX for better ordering of locales)
# allAccuracy_month %>%
# ggplot(aes(x=locale, y=other)) +
# geom_tile(aes(fill=accOverall)) +
# theme(axis.text.x=element_text(angle=90)) +
# scale_fill_continuous("Accuracy", high="darkblue", low="white") +
# labs(title="Accuracy Predicting Locale vs. Locale",
# caption="Temperature, Dew Point, and Month as predictors\n(50% is baseline coinflip accuracy)",
# x="",
# y=""
# )
Accuracy is best for cities with significantly different locales. The model is especially successful at pulling apart the desert cities (Las Vegas, Phoenix) from everything else, and the humid Florida cities (Miami, Tampa Bay) from everything else.
Next, the simple model is run to classify locale across the full 2016 dataset:
# Run random forest for all 2016 locales
rf_all_2016_TDm <- rfMultiLocale(metarData,
vrbls=c("TempF", "DewF", "month"),
locs=names_2016,
ntree=50,
seed=2006281508
)
Summaries can then be created for the accuracy in predicting each locale:
evalPredictions(rf_all_2016_TDm, plotCaption = "Temp, Dew Point, Month, Cloud/Ceiling Height")
## # A tibble: 898 x 5
## locale predicted correct n pct
## <fct> <fct> <lgl> <int> <dbl>
## 1 Atlanta, GA (2016) Atlanta, GA (2016) TRUE 484 0.182
## 2 Atlanta, GA (2016) Boston, MA (2016) FALSE 49 0.0185
## 3 Atlanta, GA (2016) Chicago, IL (2016) FALSE 39 0.0147
## 4 Atlanta, GA (2016) Dallas, TX (2016) FALSE 179 0.0675
## 5 Atlanta, GA (2016) Denver, CO (2016) FALSE 34 0.0128
## 6 Atlanta, GA (2016) Detroit, MI (2016) FALSE 42 0.0158
## 7 Atlanta, GA (2016) Grand Rapids, MI (2016) FALSE 48 0.0181
## 8 Atlanta, GA (2016) Green Bay, WI (2016) FALSE 42 0.0158
## 9 Atlanta, GA (2016) Houston, TX (2016) FALSE 96 0.0362
## 10 Atlanta, GA (2016) Indianapolis, IN (2016) FALSE 49 0.0185
## # ... with 888 more rows
Accuracy is meaningfully increased compared to the null accuracy, though there is still under 50% accuracy in classifying any locale. This is suggestive that the goal will be to define some archetype climates, and to use those for predicting (e.g., humid, cold, desert, etc.).
Clouds (minimum levels and ceiling heights) can potentially help further differentiate the cold weather cities on the downwind side of the lake, as well as helping to further pull apart various marine and mid-continental climates.
An updated random forest model is then run:
# Run random forest for all 2016 locales
rf_all_2016_TDmc <- rfMultiLocale(metarData,
vrbls=c("TempF", "DewF", "month", "minHeight", "ceilingHeight"),
locs=names_2016,
ntree=50,
seed=2006281509
)
The evaluation process is converted to a function:
evalPredictions(rf_all_2016_TDmc, plotCaption = "Temp, Dew Point, Month, Cloud/Ceiling Height")
## # A tibble: 897 x 5
## locale predicted correct n pct
## <fct> <fct> <lgl> <int> <dbl>
## 1 Atlanta, GA (2016) Atlanta, GA (2016) TRUE 643 0.240
## 2 Atlanta, GA (2016) Boston, MA (2016) FALSE 84 0.0313
## 3 Atlanta, GA (2016) Chicago, IL (2016) FALSE 61 0.0227
## 4 Atlanta, GA (2016) Dallas, TX (2016) FALSE 145 0.0541
## 5 Atlanta, GA (2016) Denver, CO (2016) FALSE 44 0.0164
## 6 Atlanta, GA (2016) Detroit, MI (2016) FALSE 51 0.0190
## 7 Atlanta, GA (2016) Grand Rapids, MI (2016) FALSE 49 0.0183
## 8 Atlanta, GA (2016) Green Bay, WI (2016) FALSE 26 0.00969
## 9 Atlanta, GA (2016) Houston, TX (2016) FALSE 82 0.0306
## 10 Atlanta, GA (2016) Indianapolis, IN (2016) FALSE 100 0.0373
## # ... with 887 more rows
# Get the comparison to previous accuracy
deltaAccuracy(prevObject=rf_all_2016_TDm, currObject=rf_all_2016_TDmc)
Accuracy increases meaningfully, though still well under 50% in most cases.
The model can also be built out to consider wind speed and wind direction. No attempt yet is made to control for over-fitting:
# Run random forest for all 2016 locales
rf_all_2016_TDmcw <- rfMultiLocale(metarData,
vrbls=c("TempF", "DewF",
"month",
"minHeight", "ceilingHeight",
"WindSpeed", "predomDir"
),
locs=names_2016,
ntree=50,
seed=2006281934
)
The evaluation process can again be run:
# Get the graphs of the prediction accuracy
evalPredictions(rf_all_2016_TDmcw,
plotCaption = "Temp, Dew Point, Month, Cloud/Ceiling Height, Wind Speed/Direction"
)
## # A tibble: 889 x 5
## locale predicted correct n pct
## <fct> <fct> <lgl> <int> <dbl>
## 1 Atlanta, GA (2016) Atlanta, GA (2016) TRUE 1096 0.415
## 2 Atlanta, GA (2016) Boston, MA (2016) FALSE 34 0.0129
## 3 Atlanta, GA (2016) Chicago, IL (2016) FALSE 46 0.0174
## 4 Atlanta, GA (2016) Dallas, TX (2016) FALSE 135 0.0511
## 5 Atlanta, GA (2016) Denver, CO (2016) FALSE 28 0.0106
## 6 Atlanta, GA (2016) Detroit, MI (2016) FALSE 31 0.0117
## 7 Atlanta, GA (2016) Grand Rapids, MI (2016) FALSE 26 0.00984
## 8 Atlanta, GA (2016) Green Bay, WI (2016) FALSE 24 0.00908
## 9 Atlanta, GA (2016) Houston, TX (2016) FALSE 58 0.0219
## 10 Atlanta, GA (2016) Indianapolis, IN (2016) FALSE 59 0.0223
## # ... with 879 more rows
# Get the comparison to previous accuracy
deltaAccuracy(prevObject=rf_all_2016_TDmc, currObject=rf_all_2016_TDmcw)
Including wind significantly improves model accuracy for many locales. Even the cold weather cities are now being predicted with around 30%-40% accucacy.
The model can also be built out to consider hour of day and sea-level pressure. No attempt yet is made to control for over-fitting, with the exception that mtry is held at 4:
# Run random forest for all 2016 locales
rf_all_2016_TDmcwa <- rfMultiLocale(metarData,
vrbls=c("TempF", "DewF",
"month",
"minHeight", "ceilingHeight",
"WindSpeed", "predomDir",
"modSLP"
),
locs=names_2016,
ntree=50,
mtry=4,
seed=2006282103
)
The evaluation process can again be run:
# Get the graphs of the prediction accuracy
evalPredictions(rf_all_2016_TDmcwa,
plotCaption = "Temp, Dew Point, Month, Cloud/Ceiling Height, Wind Speed/Direction, SLP"
)
## # A tibble: 880 x 5
## locale predicted correct n pct
## <fct> <fct> <lgl> <int> <dbl>
## 1 Atlanta, GA (2016) Atlanta, GA (2016) TRUE 1639 0.625
## 2 Atlanta, GA (2016) Boston, MA (2016) FALSE 16 0.00610
## 3 Atlanta, GA (2016) Chicago, IL (2016) FALSE 23 0.00877
## 4 Atlanta, GA (2016) Dallas, TX (2016) FALSE 75 0.0286
## 5 Atlanta, GA (2016) Denver, CO (2016) FALSE 8 0.00305
## 6 Atlanta, GA (2016) Detroit, MI (2016) FALSE 18 0.00686
## 7 Atlanta, GA (2016) Grand Rapids, MI (2016) FALSE 12 0.00457
## 8 Atlanta, GA (2016) Green Bay, WI (2016) FALSE 6 0.00229
## 9 Atlanta, GA (2016) Houston, TX (2016) FALSE 45 0.0172
## 10 Atlanta, GA (2016) Indianapolis, IN (2016) FALSE 51 0.0194
## # ... with 870 more rows
# Get the comparison to previous accuracy
deltaAccuracy(prevObject=rf_all_2016_TDmcw, currObject=rf_all_2016_TDmcwa)
Adding sea-level pressure again significantly improves prediction ability. The cold weather cities are in the roughly 40%-60% accuracy range, and the other cities are in the roughly 60%-80% accuracy range.
Based on the hierarchical clustering and results of the initial random forest models, a few archetype cities are defined in an attempt to pull out distinctions:
A data subset is created, with “hr” added for the Zulu hour of the observation (as integer rather than as factor):
archeCities <- c("kmia_2016", "klas_2016", "klax_2016", "kden_2016",
"ksea_2016", "kdfw_2016", "kmsp_2016", "katl_2016"
)
archeData <- metarData %>%
filter(source %in% archeCities) %>%
mutate(hr=lubridate::hour(dtime))
A random forest, with mtry=4, is then run using all variables from previous, as well as hour:
# Run random forest for all 2016 locales
rf_arche_2016_TDmcwa <- rfMultiLocale(archeData,
vrbls=c("TempF", "DewF",
"month", "hr",
"minHeight", "ceilingHeight",
"WindSpeed", "predomDir",
"modSLP"
),
locs=NULL, # defaults to running for all locales
ntree=100,
mtry=4,
seed=2006291353
)
##
## Running for locations:
## [1] "katl_2016" "kden_2016" "kdfw_2016" "klas_2016" "klax_2016" "kmia_2016"
## [7] "kmsp_2016" "ksea_2016"
The evaluation process can again be run:
# Get the graphs of the prediction accuracy
evalPredictions(rf_arche_2016_TDmcwa,
plotCaption = "Temp, Dew Point, Month/Hour, Cloud/Ceiling Height, Wind Speed/Direction, SLP"
)
## # A tibble: 62 x 5
## locale predicted correct n pct
## <fct> <fct> <lgl> <int> <dbl>
## 1 Atlanta, GA (2016) Atlanta, GA (2016) TRUE 2204 0.835
## 2 Atlanta, GA (2016) Dallas, TX (2016) FALSE 139 0.0527
## 3 Atlanta, GA (2016) Denver, CO (2016) FALSE 39 0.0148
## 4 Atlanta, GA (2016) Las Vegas, NV (2016) FALSE 33 0.0125
## 5 Atlanta, GA (2016) Los Angeles, CA (2016) FALSE 86 0.0326
## 6 Atlanta, GA (2016) Miami, FL (2016) FALSE 50 0.0189
## 7 Atlanta, GA (2016) Minneapolis, MN (2016) FALSE 45 0.0170
## 8 Atlanta, GA (2016) Seattle, WA (2016) FALSE 44 0.0167
## 9 Dallas, TX (2016) Atlanta, GA (2016) FALSE 203 0.0789
## 10 Dallas, TX (2016) Dallas, TX (2016) TRUE 2105 0.818
## # ... with 52 more rows
The model is reasonably successful in pulling apart the archetypes. Denver and Dallas seem potentially less distinct, though all archetypes cities are predicted at 80%+ accuracy.
The remaining cities can be classified against the archetype data, to explore any patterns that emerge:
localeByArchetype(rf_arche_2016_TDmcwa,
fullData=metarData,
archeCities=archeCities,
sortDescMatch=TRUE
)
There appear to be several issues:
The first issue is addressed by collapsing Atlanta, Dallas, and Miami and instead using Houston as the archetype; and replacing Minneapolis with Chicago:
archeCities_v02 <- c("kiah_2016", "klas_2016", "klax_2016",
"kden_2016", "ksea_2016", "kord_2016"
)
archeData_v02 <- metarData %>%
filter(source %in% archeCities_v02) %>%
mutate(hr=lubridate::hour(dtime))
A random forest, with mtry=4, is then run using all variables from previous, as well as hour:
# Run random forest for all 2016 locales in archeData_v02
rf_arche_2016_TDmcwa_v02 <- rfMultiLocale(archeData_v02,
vrbls=c("TempF", "DewF",
"month", "hr",
"minHeight", "ceilingHeight",
"WindSpeed", "predomDir",
"modSLP"
),
locs=NULL, # defaults to running for all locales
ntree=100,
mtry=4,
seed=2006291448
)
##
## Running for locations:
## [1] "kden_2016" "kiah_2016" "klas_2016" "klax_2016" "kord_2016" "ksea_2016"
The evaluation process can again be run:
# Get the graphs of the prediction accuracy
evalPredictions(rf_arche_2016_TDmcwa_v02,
plotCaption = "Temp, Dew Point, Month/Hour, Cloud/Ceiling Height, Wind Speed/Direction, SLP"
)
## # A tibble: 36 x 5
## locale predicted correct n pct
## <fct> <fct> <lgl> <int> <dbl>
## 1 Chicago, IL (2016) Chicago, IL (2016) TRUE 2321 0.872
## 2 Chicago, IL (2016) Denver, CO (2016) FALSE 99 0.0372
## 3 Chicago, IL (2016) Houston, TX (2016) FALSE 54 0.0203
## 4 Chicago, IL (2016) Las Vegas, NV (2016) FALSE 13 0.00489
## 5 Chicago, IL (2016) Los Angeles, CA (2016) FALSE 70 0.0263
## 6 Chicago, IL (2016) Seattle, WA (2016) FALSE 104 0.0391
## 7 Denver, CO (2016) Chicago, IL (2016) FALSE 119 0.0453
## 8 Denver, CO (2016) Denver, CO (2016) TRUE 2282 0.869
## 9 Denver, CO (2016) Houston, TX (2016) FALSE 3 0.00114
## 10 Denver, CO (2016) Las Vegas, NV (2016) FALSE 126 0.0480
## # ... with 26 more rows
The model is reasonably successful in pulling apart the archetypes. Houston appears the most distinct (as had Miami previously), and there is less overlap going to/from Denver or Seattle.
The remaining cities can be classified against the archetype data, to explore any patterns that emerge:
localeByArchetype(rf_arche_2016_TDmcwa_v02,
fullData=metarData,
archeCities=archeCities_v02,
sortDescMatch = TRUE
)
At a glance, the archetypes are more sensible. There is a bit of Denver and a bit of Seattle in most of the cold weather cities, and a bit of Houston in some of the warmer cold weather cities. This is suggestive that there are either more then one type of cold-weather cities, or that cold-weather cities tend to be like other archetypes are certain times.
The model is run again, deleting Denver and Seattle, and converting Los Angeles to San Diego:
archeCities_v03 <- c("kiah_2016", "klas_2016", "ksan_2016", "kord_2016")
archeData_v03 <- metarData %>%
filter(source %in% archeCities_v03) %>%
mutate(hr=lubridate::hour(dtime))
A random forest, with mtry=4, is then run using all variables from previous, as well as hour:
# Run random forest for all 2016 locales in archeData_v03
rf_arche_2016_TDmcwa_v03 <- rfMultiLocale(archeData_v03,
vrbls=c("TempF", "DewF",
"month", "hr",
"minHeight", "ceilingHeight",
"WindSpeed", "predomDir",
"modSLP"
),
locs=NULL, # defaults to running for all locales
ntree=100,
mtry=4,
seed=2006291459
)
##
## Running for locations:
## [1] "kiah_2016" "klas_2016" "kord_2016" "ksan_2016"
The evaluation process can again be run:
# Get the graphs of the prediction accuracy
evalPredictions(rf_arche_2016_TDmcwa_v03,
plotCaption = "Temp, Dew Point, Month/Hour, Cloud/Ceiling Height, Wind Speed/Direction, SLP"
)
## Warning: Removed 1 rows containing missing values (geom_text).
## # A tibble: 16 x 5
## locale predicted correct n pct
## <fct> <fct> <lgl> <int> <dbl>
## 1 Chicago, IL (2016) Chicago, IL (2016) TRUE 2511 0.946
## 2 Chicago, IL (2016) Houston, TX (2016) FALSE 69 0.0260
## 3 Chicago, IL (2016) Las Vegas, NV (2016) FALSE 26 0.00979
## 4 Chicago, IL (2016) San Diego, CA (2016) FALSE 49 0.0185
## 5 Houston, TX (2016) Chicago, IL (2016) FALSE 44 0.0170
## 6 Houston, TX (2016) Houston, TX (2016) TRUE 2473 0.954
## 7 Houston, TX (2016) Las Vegas, NV (2016) FALSE 19 0.00733
## 8 Houston, TX (2016) San Diego, CA (2016) FALSE 57 0.0220
## 9 Las Vegas, NV (2016) Chicago, IL (2016) FALSE 29 0.0113
## 10 Las Vegas, NV (2016) Houston, TX (2016) FALSE 13 0.00508
## 11 Las Vegas, NV (2016) Las Vegas, NV (2016) TRUE 2482 0.971
## 12 Las Vegas, NV (2016) San Diego, CA (2016) FALSE 33 0.0129
## 13 San Diego, CA (2016) Chicago, IL (2016) FALSE 35 0.0130
## 14 San Diego, CA (2016) Houston, TX (2016) FALSE 29 0.0108
## 15 San Diego, CA (2016) Las Vegas, NV (2016) FALSE 63 0.0235
## 16 San Diego, CA (2016) San Diego, CA (2016) TRUE 2559 0.953
These archetypes are distinct, with accuracies all in the 95% range. The remaining cities can be classified against the archetype data, to explore any patterns that emerge:
localeByArchetype(rf_arche_2016_TDmcwa_v03,
fullData=metarData,
archeCities=archeCities_v03,
sortDescMatch=TRUE
)
Quite a few cities show as blends of the archetype cities, which is likely correct if archetypes are considered to be Cold, Humid, Marine, Desert. An open question is whether to expand that list with another 1-2 archetypes that attempt to better capture Atlanta, Dallas, Denver, St Louis, and DC.
The Marine archetype may need to be converted to Pacific and to better identify the California entities.
A decision about Seattle is needed, as it is a little like everything and everything is a little like it.
To address the issue of the model learning from specific cities rather than archetypes, user-defined clusters will be created, and data from these clusters will be used in modeling:
Further, a variable will be created for “hr”, the Zulu hour of the observation:
# Create the locale mapper
locMapperTibble <- tibble::tribble(
~source, ~locType,
'katl_2016', 'South',
'kbos_2016', 'Exclude',
'kdca_2016', 'Exclude',
'kden_2016', 'Exclude',
'kdfw_2016', 'South',
'kdtw_2016', 'Cold',
'kewr_2016', 'Exclude',
'kgrb_2016', 'Cold',
'kgrr_2016', 'Cold',
'kiah_2016', 'Humid',
'kind_2016', 'Mixed',
'klas_2016', 'Desert',
'klax_2016', 'Marine',
'klnk_2016', 'Mixed',
'kmia_2016', 'Humid',
'kmke_2016', 'Cold',
'kmsn_2016', 'Cold',
'kmsp_2016', 'Cold',
'kmsy_2016', 'Humid',
'kord_2016', 'Cold',
'kphl_2016', 'Exclude',
'kphx_2016', 'Desert',
'ksan_2016', 'Marine',
'ksat_2016', 'South',
'ksea_2016', 'Exclude',
'ksfo_2016', 'Exclude',
'ksjc_2016', 'Exclude',
'kstl_2016', 'Mixed',
'ktpa_2016', 'Humid',
'ktvc_2016', 'Cold'
)
# Create locMapper
locMapper <- locMapperTibble %>% pull(locType)
names(locMapper) <- locMapperTibble %>% pull(source)
locMapper
## katl_2016 kbos_2016 kdca_2016 kden_2016 kdfw_2016 kdtw_2016 kewr_2016 kgrb_2016
## "South" "Exclude" "Exclude" "Exclude" "South" "Cold" "Exclude" "Cold"
## kgrr_2016 kiah_2016 kind_2016 klas_2016 klax_2016 klnk_2016 kmia_2016 kmke_2016
## "Cold" "Humid" "Mixed" "Desert" "Marine" "Mixed" "Humid" "Cold"
## kmsn_2016 kmsp_2016 kmsy_2016 kord_2016 kphl_2016 kphx_2016 ksan_2016 ksat_2016
## "Cold" "Cold" "Humid" "Cold" "Exclude" "Desert" "Marine" "South"
## ksea_2016 ksfo_2016 ksjc_2016 kstl_2016 ktpa_2016 ktvc_2016
## "Exclude" "Exclude" "Exclude" "Mixed" "Humid" "Cold"
# Create the data file with locType
locData_v04 <- metarData %>%
mutate(locType=locMapper[source], hr=lubridate::hour(dtime)) %>%
filter(year==2016, locType !="Exclude")
# Counts by locType
locData_v04 %>%
count(locType)
## # A tibble: 6 x 2
## locType n
## <chr> <int>
## 1 Cold 70101
## 2 Desert 17543
## 3 Humid 35074
## 4 Marine 17536
## 5 Mixed 26187
## 6 South 26309
There is a risk that the model will predict neutral cities as ‘Cold’ given the 4:1 ratio of Cold cities to Marine/Desert cities. The random forest model is first run on the data ‘as is’:
rf_arche_small_TDmcwa_v04 <- rfMultiLocale(locData_v04,
vrbls=c("TempF", "DewF",
"month", "hr",
"minHeight", "ceilingHeight",
"WindSpeed", "predomDir",
"modSLP"
),
locs=NULL, # defaults to running for all locales
ntree=100,
locVar="locType",
pred="locType",
mtry=4,
otherVar=c("dtime", "source", "locale"),
seed=2006301313
)
##
## Running for locations:
## [1] "Cold" "Desert" "Humid" "Marine" "Mixed" "South"
Prediction accuracy is then assessed:
# Get the graphs of the prediction accuracy
evalPredictions(rf_arche_small_TDmcwa_v04,
plotCaption = "Temp, Dew Point, Month/Hour, Cloud/Ceiling Height, Wind Speed/Direction, SLP",
keyVar="locType"
)
## # A tibble: 36 x 5
## locale predicted correct n pct
## <fct> <fct> <lgl> <int> <dbl>
## 1 Cold Cold TRUE 19819 0.947
## 2 Cold Desert FALSE 37 0.00177
## 3 Cold Humid FALSE 119 0.00568
## 4 Cold Marine FALSE 159 0.00759
## 5 Cold Mixed FALSE 549 0.0262
## 6 Cold South FALSE 254 0.0121
## 7 Desert Cold FALSE 122 0.0234
## 8 Desert Desert TRUE 4790 0.921
## 9 Desert Humid FALSE 41 0.00788
## 10 Desert Marine FALSE 115 0.0221
## # ... with 26 more rows
The designation of Cold, Desert, Humid, and Marine seems successful. The attempt to split South from Humid resulted in good, if incomplete, separation. The attempt to separate a group of ‘warmer’ cold-weather cities from ‘colder’ cold-weather cities was not particularly successful. This could partially be an artifact of ‘Cold’ having double the data volume as ‘Mixed’.
Classifications by city can also be examined:
archeCities_v04 <- locMapperTibble %>%
filter(locType != "Exclude") %>%
pull(source)
localeByArchetype(rf_arche_small_TDmcwa_v04,
fullData=mutate(metarData, locType=locMapper[source]),
archeCities=archeCities_v04,
sortDescMatch=TRUE
)
Many of the cities are nicely classified in to their assigned archetypes, as desired. Among the cities used in the classifications, concerns include:
Classifications are changed somewhat, and data are then filtered so that an equal number of observations from each locale type are applied in the modeling:
# Create the locale mapper
locMapperTibble_v05 <- tibble::tribble(
~source, ~locType,
'katl_2016', 'Exclude',
'kbos_2016', 'Exclude',
'kdca_2016', 'Exclude',
'kden_2016', 'Exclude',
'kdfw_2016', 'Exclude',
'kdtw_2016', 'Cold',
'kewr_2016', 'Exclude',
'kgrb_2016', 'Cold',
'kgrr_2016', 'Cold',
'kiah_2016', 'Humid',
'kind_2016', 'Exclude',
'klas_2016', 'Desert',
'klax_2016', 'Marine',
'klnk_2016', 'Exclude',
'kmia_2016', 'Humid',
'kmke_2016', 'Cold',
'kmsn_2016', 'Cold',
'kmsp_2016', 'Cold',
'kmsy_2016', 'Humid',
'kord_2016', 'Cold',
'kphl_2016', 'Exclude',
'kphx_2016', 'Desert',
'ksan_2016', 'Marine',
'ksat_2016', 'Exclude',
'ksea_2016', 'Exclude',
'ksfo_2016', 'Exclude',
'ksjc_2016', 'Exclude',
'kstl_2016', 'Exclude',
'ktpa_2016', 'Humid',
'ktvc_2016', 'Cold'
)
# Create locMapper
locMapper_v05 <- locMapperTibble_v05 %>% pull(locType)
names(locMapper_v05) <- locMapperTibble_v05 %>% pull(source)
locMapper_v05
## katl_2016 kbos_2016 kdca_2016 kden_2016 kdfw_2016 kdtw_2016 kewr_2016 kgrb_2016
## "Exclude" "Exclude" "Exclude" "Exclude" "Exclude" "Cold" "Exclude" "Cold"
## kgrr_2016 kiah_2016 kind_2016 klas_2016 klax_2016 klnk_2016 kmia_2016 kmke_2016
## "Cold" "Humid" "Exclude" "Desert" "Marine" "Exclude" "Humid" "Cold"
## kmsn_2016 kmsp_2016 kmsy_2016 kord_2016 kphl_2016 kphx_2016 ksan_2016 ksat_2016
## "Cold" "Cold" "Humid" "Cold" "Exclude" "Desert" "Marine" "Exclude"
## ksea_2016 ksfo_2016 ksjc_2016 kstl_2016 ktpa_2016 ktvc_2016
## "Exclude" "Exclude" "Exclude" "Exclude" "Humid" "Cold"
# Create the data file with locType
locData_v05 <- metarData %>%
mutate(locType=locMapper_v05[source], hr=lubridate::hour(dtime)) %>%
filter(year==2016, locType !="Exclude")
# Counts by locType
locData_v05 %>%
count(locType)
## # A tibble: 4 x 2
## locType n
## <chr> <int>
## 1 Cold 70101
## 2 Desert 17543
## 3 Humid 35074
## 4 Marine 17536
The data are the filtered so that there are an equal number of observations from each locale type. The model can later be predicted against the remaining items:
# Set a seed for reporducibility
set.seed(2006301356)
# Find the smallest locale type
nSmall <- locData_v05 %>%
filter(!is.na(locType)) %>%
count(locType) %>%
pull(n) %>%
min()
# Create the relevant data subset
subData_v05 <- locData_v05 %>%
filter(!is.na(locType)) %>%
group_by(locType) %>%
sample_n(size=nSmall, replace=FALSE) %>%
ungroup()
# Sumarize the data subset
subData_v05 %>%
count(locType, locale) %>%
arrange(-n)
## # A tibble: 16 x 3
## locType locale n
## <chr> <chr> <int>
## 1 Marine Los Angeles, CA (2016) 8774
## 2 Desert Phoenix, AZ (2016) 8770
## 3 Desert Las Vegas, NV (2016) 8766
## 4 Marine San Diego, CA (2016) 8762
## 5 Humid New Orleans, LA (2016) 4402
## 6 Humid Miami, FL (2016) 4388
## 7 Humid Houston, TX (2016) 4386
## 8 Humid Tampa Bay, FL (2016) 4360
## 9 Cold Grand Rapids, MI (2016) 2234
## 10 Cold Milwaukee, WI (2016) 2231
## 11 Cold Minneapolis, MN (2016) 2205
## 12 Cold Madison, WI (2016) 2188
## 13 Cold Traverse City, MI (2016) 2182
## 14 Cold Detroit, MI (2016) 2173
## 15 Cold Chicago, IL (2016) 2168
## 16 Cold Green Bay, WI (2016) 2155
The previous model can then be run on the data subset:
rf_arche_small_TDmcwa_v05 <- rfMultiLocale(subData_v05,
vrbls=c("TempF", "DewF",
"month", "hr",
"minHeight", "ceilingHeight",
"WindSpeed", "predomDir",
"modSLP"
),
locs=NULL, # defaults to running for all locales
ntree=100,
locVar="locType",
pred="locType",
mtry=4,
otherVar=c("dtime", "source", "locale"),
seed=2006301358
)
##
## Running for locations:
## [1] "Cold" "Desert" "Humid" "Marine"
The evaluation process can again be run:
# Get the graphs of the prediction accuracy
evalPredictions(rf_arche_small_TDmcwa_v05,
plotCaption = "Temp, Dew Point, Month/Hour, Cloud/Ceiling Height, Wind Speed/Direction, SLP",
keyVar="locType"
)
## # A tibble: 16 x 5
## locale predicted correct n pct
## <fct> <fct> <lgl> <int> <dbl>
## 1 Cold Cold TRUE 4864 0.931
## 2 Cold Desert FALSE 68 0.0130
## 3 Cold Humid FALSE 116 0.0222
## 4 Cold Marine FALSE 175 0.0335
## 5 Desert Cold FALSE 58 0.0112
## 6 Desert Desert TRUE 4925 0.955
## 7 Desert Humid FALSE 34 0.00659
## 8 Desert Marine FALSE 141 0.0273
## 9 Humid Cold FALSE 99 0.0188
## 10 Humid Desert FALSE 64 0.0121
## 11 Humid Humid TRUE 4916 0.932
## 12 Humid Marine FALSE 194 0.0368
## 13 Marine Cold FALSE 88 0.0166
## 14 Marine Desert FALSE 210 0.0396
## 15 Marine Humid FALSE 81 0.0153
## 16 Marine Marine TRUE 4920 0.928
The archetypes are well-separated, with roughly 95% accuracy in every case. Predictions can then also be made for all cities, including those not in the modeling:
archeCities_v05 <- locMapperTibble_v05 %>%
filter(locType != "Exclude") %>%
pull(source)
localeByArchetype(rf_arche_small_TDmcwa_v05,
fullData=mutate(metarData, locType=locMapper_v05[source]),
archeCities=archeCities_v05,
sortDescMatch=TRUE
)
Broadly speaking, cities used in modeling appear to classify in to the appropriate archetypes. For the cities not included in the modeling:
The model can be applied to data from years other than 2016, to see how the classifications are impacted by use in out-of-sample years:
# Predictions on non-2016 data
helperPredictPlot(rf_arche_small_TDmcwa_v05$rfModel,
df=filter(mutate(metarData, hr=lubridate::hour(dtime)), year!=2016),
predOrder=c("Marine", "Humid", "Desert", "Cold"),
locMapper=locMapper_v05
)
## # A tibble: 80 x 4
## locale predicted n pct
## <chr> <fct> <int> <dbl>
## 1 Chicago, IL (2014) Cold 7928 0.915
## 2 Chicago, IL (2014) Desert 273 0.0315
## 3 Chicago, IL (2014) Humid 144 0.0166
## 4 Chicago, IL (2014) Marine 318 0.0367
## 5 Chicago, IL (2015) Cold 7722 0.886
## 6 Chicago, IL (2015) Desert 247 0.0283
## 7 Chicago, IL (2015) Humid 340 0.0390
## 8 Chicago, IL (2015) Marine 411 0.0471
## 9 Chicago, IL (2017) Cold 7660 0.878
## 10 Chicago, IL (2017) Desert 466 0.0534
## # ... with 70 more rows
Model performance on non-2016 data is not as strong, with roughly a 10%-20% loss of accuracy. Predictions are still much better than null accuracy, and the model (mostly) continues to separate the archetypes.
Suppose that models are run on all 2015-2018 data for Chicago, Las Vegas, New Orleans, and San Diego:
# Create the subset for Chicago, Las Vegas, New Orleans, San Diego
sub_2015_2018_data <- metarData %>%
filter(str_sub(source, 1, 4) %in% c("kord", "klas", "kmsy", "ksan"),
year %in% c(2015, 2016, 2017, 2018)
) %>%
mutate(city=str_replace(locale, pattern=" .\\d{4}.", replacement=""),
hr=lubridate::hour(dtime)
)
# Check that proper locales are included
sub_2015_2018_data %>%
count(city, locale)
## # A tibble: 16 x 3
## city locale n
## <chr> <chr> <int>
## 1 Chicago, IL Chicago, IL (2015) 8728
## 2 Chicago, IL Chicago, IL (2016) 8767
## 3 Chicago, IL Chicago, IL (2017) 8740
## 4 Chicago, IL Chicago, IL (2018) 8737
## 5 Las Vegas, NV Las Vegas, NV (2015) 8727
## 6 Las Vegas, NV Las Vegas, NV (2016) 8770
## 7 Las Vegas, NV Las Vegas, NV (2017) 8664
## 8 Las Vegas, NV Las Vegas, NV (2018) 8746
## 9 New Orleans, LA New Orleans, LA (2015) 8727
## 10 New Orleans, LA New Orleans, LA (2016) 8767
## 11 New Orleans, LA New Orleans, LA (2017) 8723
## 12 New Orleans, LA New Orleans, LA (2018) 8737
## 13 San Diego, CA San Diego, CA (2015) 8737
## 14 San Diego, CA San Diego, CA (2016) 8762
## 15 San Diego, CA San Diego, CA (2017) 8744
## 16 San Diego, CA San Diego, CA (2018) 8744
The random forest model is run and cached:
# Run random forest for 2015-2018 data
rf_types_2015_2018_TDmcwha <- rfMultiLocale(sub_2015_2018_data,
vrbls=c("TempF", "DewF",
"month", "hr",
"minHeight", "ceilingHeight",
"WindSpeed", "predomDir",
"modSLP"
),
locs=NULL,
locVar="city",
pred="city",
ntree=50,
seed=2006301420,
mtry=4
)
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
evalPredictions(rf_types_2015_2018_TDmcwha,
plotCaption = "Temp, Dew Point, Month, Hour of Day, Cloud Height, Wind, SLP",
keyVar="city"
)
## Warning: Removed 1 rows containing missing values (geom_text).
## # A tibble: 16 x 5
## locale predicted correct n pct
## <fct> <fct> <lgl> <int> <dbl>
## 1 Chicago, IL Chicago, IL TRUE 9847 0.941
## 2 Chicago, IL Las Vegas, NV FALSE 121 0.0116
## 3 Chicago, IL New Orleans, LA FALSE 236 0.0226
## 4 Chicago, IL San Diego, CA FALSE 258 0.0247
## 5 Las Vegas, NV Chicago, IL FALSE 153 0.0146
## 6 Las Vegas, NV Las Vegas, NV TRUE 10048 0.961
## 7 Las Vegas, NV New Orleans, LA FALSE 60 0.00574
## 8 Las Vegas, NV San Diego, CA FALSE 192 0.0184
## 9 New Orleans, LA Chicago, IL FALSE 224 0.0213
## 10 New Orleans, LA Las Vegas, NV FALSE 98 0.00932
## 11 New Orleans, LA New Orleans, LA TRUE 9859 0.938
## 12 New Orleans, LA San Diego, CA FALSE 329 0.0313
## 13 San Diego, CA Chicago, IL FALSE 173 0.0166
## 14 San Diego, CA Las Vegas, NV FALSE 281 0.0269
## 15 San Diego, CA New Orleans, LA FALSE 197 0.0189
## 16 San Diego, CA San Diego, CA TRUE 9789 0.938
Even with a small forest (50 trees), the model is almost always separating Las Vegas, Chicago, San Diego, and New Orleans. While the climates are very different in these cities, it is striking that the model has so few misclassifications.
How do other cities map against these classifications?
# Predictions on 2015/2018 data
helperPredictPlot(rf_types_2015_2018_TDmcwha$rfModel,
df=filter(mutate(metarData, hr=lubridate::hour(dtime)),
!(str_sub(source, 1, 4) %in% c("kord", "klas", "kmsy", "ksan"))
),
predOrder=c("Chicago, IL", "San Diego, CA", "New Orleans, LA", "Las Vegas, NV")
)
## # A tibble: 104 x 4
## locale predicted n pct
## <chr> <fct> <int> <dbl>
## 1 Atlanta, GA (2016) Chicago, IL 3363 0.384
## 2 Atlanta, GA (2016) Las Vegas, NV 809 0.0924
## 3 Atlanta, GA (2016) New Orleans, LA 3709 0.424
## 4 Atlanta, GA (2016) San Diego, CA 870 0.0994
## 5 Boston, MA (2016) Chicago, IL 7710 0.891
## 6 Boston, MA (2016) Las Vegas, NV 430 0.0497
## 7 Boston, MA (2016) New Orleans, LA 291 0.0336
## 8 Boston, MA (2016) San Diego, CA 223 0.0258
## 9 Dallas, TX (2016) Chicago, IL 2999 0.343
## 10 Dallas, TX (2016) Las Vegas, NV 1147 0.131
## # ... with 94 more rows
Classifications are broadly as expected based on the previous archetype analysis. Variable importances are plotted:
helperPlotVarImp(rf_types_2015_2018_TDmcwha$rfModel)
Dew point and temperature by month continue to be strong factors for separating the four cities in this analysis. SLP, minimum cloud height, and prevailing wind direction are also meaningful.
An assessment can be run for the 2015-2018 model:
# Run for the full model including SLP
probs_2015_2018_TDmcwha <-
assessPredictionCertainty(rf_types_2015_2018_TDmcwha,
keyVar="city",
plotCaption="Temp, Dew Point, Month/Hour, Clouds, Wind, SLP",
showAcc=TRUE
)
## Note: Using an external vector in selections is ambiguous.
## i Use `all_of(keyVar)` instead of `keyVar` to silence this message.
## i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
## Note: Using an external vector in selections is ambiguous.
## i Use `all_of(pkVars)` instead of `pkVars` to silence this message.
## i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
A similar process can be run for assessing the classification of the other cities against the 2015-2017 data for Chicago, Las Vegas, New Orleans, and San Diego:
useData <- metarData %>%
filter(!(str_sub(source, 1, 4) %in% c("kord", "klas", "kmsy", "ksan"))) %>%
mutate(hr=lubridate::hour(dtime))
# Run for the model excluding SLP
probs_allcities_2015_2018_TDmcwh <-
assessPredictionCertainty(rf_types_2015_2018_TDmcwha,
testData=useData,
keyVar="locale",
plotCaption="Temp, Dew Point, Month/Hour, Clouds, Wind, modSLP",
showHists=TRUE
)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The model is frequently not so confident in assigning an archetype to related cities, though it frequently gets the most sensible assignment.
An assessment is run to look at the evolution of the model as it take in the ‘next best’ variable for building out the random forest:
# Define the variables to be considered
possVars <- c("TempF", "DewF", "month", "hr", "minHeight",
"ceilingHeight", "WindSpeed", "predomDir", "modSLP"
)
# Create a container for storing the best random forest and variable added from each run
rfContainer <- vector("list", length=length(possVars))
rfVarAdds <- vector("character", length=length(possVars))
# Run the functions using a for loop over the length of possVars
for (ctr in 1:length(possVars)) {
# Pull in the results of the previous run
if (ctr==1) {
prevVars <- c()
} else {
prevVars <- rfVarAdds[1:(ctr-1)]
}
# Run each of them through the combinations
tmpList <- helperRFCombinations(possVars, df=sub_2015_2018_data, prevVars=prevVars)
# Assess the performance
tmpAccuracy <- helperVariableAccuracy(tmpList, possVars=possVars, prevVars=prevVars)
# Prepare to repeat the process
bestRow <- tmpAccuracy %>%
filter(locale=="OOB") %>%
filter(accuracy==max(accuracy))
bestRow
# Update the rfContainer and rfVarAdds elements
rfContainer[[ctr]] <- tmpList[[as.integer(pull(bestRow, vrblNum))]]
rfVarAdds[ctr] <- pull(bestRow, vrblName)
cat("\nVariable Added:", rfVarAdds[ctr], "\n")
}
##
## Will run for: TempF DewF month hr minHeight ceilingHeight WindSpeed predomDir modSLP
## Fixed variable(s) will be:
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Variable Added: DewF
##
## Will run for: TempF month hr minHeight ceilingHeight WindSpeed predomDir modSLP
## Fixed variable(s) will be: DewF
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Variable Added: TempF
##
## Will run for: month hr minHeight ceilingHeight WindSpeed predomDir modSLP
## Fixed variable(s) will be: DewF TempF
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Variable Added: month
##
## Will run for: hr minHeight ceilingHeight WindSpeed predomDir modSLP
## Fixed variable(s) will be: DewF TempF month
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Variable Added: modSLP
##
## Will run for: hr minHeight ceilingHeight WindSpeed predomDir
## Fixed variable(s) will be: DewF TempF month modSLP
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Variable Added: predomDir
##
## Will run for: hr minHeight ceilingHeight WindSpeed
## Fixed variable(s) will be: DewF TempF month modSLP predomDir
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Variable Added: minHeight
##
## Will run for: hr ceilingHeight WindSpeed
## Fixed variable(s) will be: DewF TempF month modSLP predomDir minHeight
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Variable Added: hr
##
## Will run for: ceilingHeight WindSpeed
## Fixed variable(s) will be: DewF TempF month modSLP predomDir minHeight hr
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Variable Added: WindSpeed
##
## Will run for: ceilingHeight
## Fixed variable(s) will be: DewF TempF month modSLP predomDir minHeight hr WindSpeed
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Variable Added: ceilingHeight
The evolution of accuracy can then be plotted:
# Pull the accuracy data from the variables selected
tblAccuracy <- map_dfr(rfContainer, .f=helperExtractAccuracy, .id="vrblNum") %>%
mutate(vrblName=rfVarAdds[as.integer(vrblNum)],
locale=factor(locale, levels=c("OOB", unique(locale)[unique(locale) != "OOB"])),
plotLabel=ifelse(as.integer(vrblNum)==1, vrblName, paste0("add ", vrblName))
)
# Plot the gains in accuracy, facetted by 'locale'
ggplot(tblAccuracy, aes(x=fct_reorder(plotLabel, -as.integer(vrblNum)))) +
geom_text(aes(y=accuracy, label=paste0(round(100*accuracy), "%"))) +
facet_wrap(~locale, nrow=1) +
ylim(c(0, 1)) +
geom_hline(yintercept=0.25, lty=2) +
labs(x="",
y="",
title="Evolution of accuracy as next-best variables are added",
caption="Null accuracy is 25%"
) +
coord_flip()
As seen previously, dew point, temperature, and month are significant differentiating variables, driving roughly 75% accuracy in classifying the archetypes.
Adding altimeter (SLP) bumps the accuracy by another ~10%, while adding minimum cloud height and prevailing wind direction bump the accuracy by another ~5%.
This shows some of the power of the random forest algorithm as it is given additional variables to explore. Evolution of error can be plotted to see the impact:
oobError <- c()
for (ctr in 1:9) {
oobError <- c(oobError, rfContainer[[ctr]]$rfModel$err.rate[, "OOB"])
}
tibble::tibble(nVars=rep(1:9, each=25),
ntrees=rep(1:25, times=9),
accuracy=1-oobError
) %>%
ggplot(aes(x=ntrees, y=accuracy)) +
geom_line(aes(group=nVars, color=factor(nVars))) +
labs(x="Number of Trees",
y="OOB Accuracy",
title="Accuracy improves with more trees and more variables"
) +
scale_color_discrete("# Variables")
With a greater number of variables, there is a greater lift in accuracy as the forest grows larger.
Suppose that an attempt is made to classify the year for a single city. Example code includes:
# Create a subset of the data for only Chicago (2014-2019)
sub_kord_data <- metarData %>%
mutate(hr=lubridate::hour(dtime)) %>%
filter(str_sub(source, 1 ,4) %in% "kord")
# Run random forest for Chicago 2014-2019 data
rf_kord_TDmcwha <- rfMultiLocale(sub_kord_data,
vrbls=c("TempF", "DewF",
"month", "hr",
"minHeight", "ceilingHeight",
"WindSpeed", "predomDir",
"modSLP"
),
locs=NULL,
locVar="year",
pred="year",
ntree=200,
seed=2007011309,
mtry=4
)
##
## Running for locations:
## [1] 2014 2015 2016 2017 2018 2019
Prediction accuracy and variable importance can then be assessed:
# Evaluate prediction accuracy
evalPredictions(rf_kord_TDmcwha,
plotCaption = "Temp, Dew Point, Month, Hour of Day, Cloud Height, Wind, SLP",
keyVar="year",
locOrder=TRUE
)
## # A tibble: 36 x 5
## locale predicted correct n pct
## <fct> <fct> <lgl> <int> <dbl>
## 1 2014 2014 TRUE 2016 0.798
## 2 2014 2015 FALSE 136 0.0539
## 3 2014 2016 FALSE 85 0.0337
## 4 2014 2017 FALSE 101 0.04
## 5 2014 2018 FALSE 102 0.0404
## 6 2014 2019 FALSE 85 0.0337
## 7 2015 2014 FALSE 126 0.048
## 8 2015 2015 TRUE 2033 0.774
## 9 2015 2016 FALSE 145 0.0552
## 10 2015 2017 FALSE 103 0.0392
## # ... with 26 more rows
# Evaluate variable importance
helperPlotVarImp(rf_kord_TDmcwha$rfModel)
# Evaluate error evolution
errorEvolution(rf_kord_TDmcwha, useCategory="OOB", subT="Chicago (2014-2019)")
## # A tibble: 200 x 3
## ntree Category Error
## <int> <chr> <dbl>
## 1 1 OOB 0.491
## 2 2 OOB 0.488
## 3 3 OOB 0.489
## 4 4 OOB 0.476
## 5 5 OOB 0.469
## 6 6 OOB 0.454
## 7 7 OOB 0.442
## 8 8 OOB 0.428
## 9 9 OOB 0.418
## 10 10 OOB 0.408
## # ... with 190 more rows
Interestingly, the model predicts the year with 75%-80% accuracy (null accuracy 17%), suggesting there is significant annual variation in Chicago climate. Sea-level-pressure has the largest variable importance (SLP is a mix of altitude, temperature, dew point, and high/low pressure systems). The first 50 trees significantly reduce the OOB error, with more modest, but still continuing, error reduction afterwards.
The process is run for Las Vegas:
# Create a subset of the data for only Las Vegas (2014-2019)
sub_klas_data <- metarData %>%
mutate(hr=lubridate::hour(dtime)) %>%
filter(str_sub(source, 1 ,4) %in% "klas")
# Run random forest for Las Vegas 2014-2019 data
rf_klas_TDmcwha <- rfMultiLocale(sub_klas_data,
vrbls=c("TempF", "DewF",
"month", "hr",
"minHeight", "ceilingHeight",
"WindSpeed", "predomDir",
"modSLP"
),
locs=NULL,
locVar="year",
pred="year",
ntree=200,
seed=2007011405,
mtry=4
)
##
## Running for locations:
## [1] 2014 2015 2016 2017 2018 2019
Prediction accuracy and variable importance can then be assessed:
# Evaluate prediction accuracy
evalPredictions(rf_klas_TDmcwha,
plotCaption = "Temp, Dew Point, Month, Hour of Day, Cloud Height, Wind, SLP",
keyVar="year",
locOrder=TRUE
)
## # A tibble: 36 x 5
## locale predicted correct n pct
## <fct> <fct> <lgl> <int> <dbl>
## 1 2014 2014 TRUE 1885 0.730
## 2 2014 2015 FALSE 175 0.0678
## 3 2014 2016 FALSE 142 0.0550
## 4 2014 2017 FALSE 149 0.0577
## 5 2014 2018 FALSE 109 0.0422
## 6 2014 2019 FALSE 121 0.0469
## 7 2015 2014 FALSE 197 0.0757
## 8 2015 2015 TRUE 1711 0.658
## 9 2015 2016 FALSE 162 0.0623
## 10 2015 2017 FALSE 169 0.0650
## # ... with 26 more rows
# Evaluate variable importance
helperPlotVarImp(rf_klas_TDmcwha$rfModel)
# Evaluate error evolution
errorEvolution(rf_klas_TDmcwha, useCategory="OOB", subT="Las Vegas (2014-2019)")
## # A tibble: 200 x 3
## ntree Category Error
## <int> <chr> <dbl>
## 1 1 OOB 0.576
## 2 2 OOB 0.569
## 3 3 OOB 0.565
## 4 4 OOB 0.560
## 5 5 OOB 0.554
## 6 6 OOB 0.548
## 7 7 OOB 0.536
## 8 8 OOB 0.527
## 9 9 OOB 0.518
## 10 10 OOB 0.511
## # ... with 190 more rows
Prediction accuracies are lower for Las Vegas, averaging 60%-75% (2014 appears to be much more differentiated than the other years). This is suggestive that there is less annual variation in the Las Vegas climate than there is in the Chicago climate, though still enough to meaningfully pull apart the years.
Modified seal-level pressure is the top predictor, and the majority of the OOB error reduction occurs in the first 50 trees.
Suppose that modSLP is eliminated as a predictor variable, and the process is scaled down to 50 trees and repeated for Chicago:
# Run random forest for Chicago 2014-2019 data
rf_kord_TDmcwh_50 <- rfMultiLocale(sub_kord_data,
vrbls=c("TempF", "DewF",
"month", "hr",
"minHeight", "ceilingHeight",
"WindSpeed", "predomDir"
),
locs=NULL,
locVar="year",
pred="year",
ntree=50,
seed=2007011410,
mtry=4
)
##
## Running for locations:
## [1] 2014 2015 2016 2017 2018 2019
Prediction accuracy and variable importance can then be assessed:
# Evaluate prediction accuracy
evalPredictions(rf_kord_TDmcwh_50,
plotCaption = "Temp, Dew Point, Month, Hour of Day, Cloud Height, Wind",
keyVar="year",
locOrder=TRUE
)
## # A tibble: 36 x 5
## locale predicted correct n pct
## <fct> <fct> <lgl> <int> <dbl>
## 1 2014 2014 TRUE 1621 0.629
## 2 2014 2015 FALSE 194 0.0753
## 3 2014 2016 FALSE 175 0.0679
## 4 2014 2017 FALSE 217 0.0842
## 5 2014 2018 FALSE 197 0.0764
## 6 2014 2019 FALSE 174 0.0675
## 7 2015 2014 FALSE 240 0.0901
## 8 2015 2015 TRUE 1554 0.583
## 9 2015 2016 FALSE 238 0.0893
## 10 2015 2017 FALSE 196 0.0736
## # ... with 26 more rows
# Evaluate variable importance
helperPlotVarImp(rf_kord_TDmcwh_50$rfModel)
# Compare error evolution through the first 50 trees
tibble::tibble(ntree=1:50,
withSLP=rf_kord_TDmcwha$rfModel$err.rate[1:50, "OOB"],
noSLP=rf_kord_TDmcwh_50$rfModel$err.rate[1:50, "OOB"]
) %>%
pivot_longer(-ntree, names_to="Model", values_to="Error") %>%
ggplot(aes(x=ntree, y=Error, color=Model)) +
geom_line() +
geom_text(data=~filter(., ntree %in% c(1, 50)),
aes(y=Error+0.02, label=paste0(round(100*Error, 1), "%"))
) +
labs(x="# Trees",
y="OOB Error Rate",
title="OOB Error Evolution for SLP Included/Excluded"
) +
ylim(c(0, NA))
Modified sea-level pressure is revealed to be a significant driver of prediction accuracy for Chicago, confirming findings of the previous variable importance analysis. This suggests different patterns of high and low pressure being in control by year may meaningfully improve the ability to predict Chicago by year.
The process can also be run for Las Vegas:
# Run random forest for Las Vegas 2014-2019 data
rf_klas_TDmcwh_50 <- rfMultiLocale(sub_klas_data,
vrbls=c("TempF", "DewF",
"month", "hr",
"minHeight", "ceilingHeight",
"WindSpeed", "predomDir"
),
locs=NULL,
locVar="year",
pred="year",
ntree=50,
seed=2007011420,
mtry=4
)
##
## Running for locations:
## [1] 2014 2015 2016 2017 2018 2019
Prediction accuracy and variable importance can then be assessed:
# Evaluate prediction accuracy
evalPredictions(rf_klas_TDmcwh_50,
plotCaption = "Temp, Dew Point, Month, Hour of Day, Cloud Height, Wind",
keyVar="year",
locOrder=TRUE
)
## # A tibble: 36 x 5
## locale predicted correct n pct
## <fct> <fct> <lgl> <int> <dbl>
## 1 2014 2014 TRUE 1610 0.630
## 2 2014 2015 FALSE 258 0.101
## 3 2014 2016 FALSE 177 0.0693
## 4 2014 2017 FALSE 194 0.0759
## 5 2014 2018 FALSE 148 0.0579
## 6 2014 2019 FALSE 168 0.0658
## 7 2015 2014 FALSE 298 0.115
## 8 2015 2015 TRUE 1338 0.516
## 9 2015 2016 FALSE 228 0.0880
## 10 2015 2017 FALSE 227 0.0876
## # ... with 26 more rows
# Evaluate variable importance
helperPlotVarImp(rf_klas_TDmcwh_50$rfModel)
# Compare error evolution through the first 50 trees
tibble::tibble(ntree=1:50,
withSLP=rf_klas_TDmcwha$rfModel$err.rate[1:50, "OOB"],
noSLP=rf_klas_TDmcwh_50$rfModel$err.rate[1:50, "OOB"]
) %>%
pivot_longer(-ntree, names_to="Model", values_to="Error") %>%
ggplot(aes(x=ntree, y=Error, color=Model)) +
geom_line() +
geom_text(data=~filter(., ntree %in% c(1, 50)),
aes(y=Error+0.02, label=paste0(round(100*Error, 1), "%"))
) +
labs(x="# Trees",
y="OOB Error Rate",
title="OOB Error Evolution for SLP Included/Excluded"
) +
ylim(c(0, NA))
Modified sea-level pressure is revealed as a meaningful driver of prediction accuracy for Las Vegas also, though with a lesser effect than seen in Chicago. This potentially suggests less variability by year in SLP for Las Vegas.
Since SLP is so meaningful, are there any patterns by year?
sub_kord_data %>%
bind_rows(sub_klas_data, .id="cityFile") %>%
select(cityFile, year, month, modSLP) %>%
mutate(city=case_when(cityFile==1 ~ "Chicago", cityFile==2 ~ "Las Vegas", TRUE ~ "Error")) %>%
filter(!is.na(modSLP)) %>%
ggplot(aes(x=factor(year), y=modSLP)) +
geom_boxplot(fill="lightblue") +
facet_wrap(~city) +
labs(x="", y="modSLP", title="Modified Sea-Level Pressure by Year (Chicago and Las Vegas)")
Chicago has a higher sea-level pressure than Las Vegas (as expected), though neither city shows much variation on average from year to year. The Chicago data are explored further by month:
sub_kord_data %>%
select(year, month, modSLP) %>%
filter(!is.na(modSLP)) %>%
ggplot(aes(x=month, y=modSLP)) +
geom_boxplot(fill="lightblue") +
facet_wrap(~year) +
labs(x="", y="modSLP", title="Modified Sea-Level Pressure by Month and Year (Chicago)")
sub_kord_data %>%
select(year, month, modSLP) %>%
filter(!is.na(modSLP)) %>%
group_by(year, month) %>%
summarize(medSLP=median(modSLP)) %>%
ggplot(aes(x=month, y=medSLP, color=factor(year), group=year)) +
geom_line(lwd=1) +
labs(x="", y="Median modSLP", title="Median modified Sea-Level Pressure by Month and Year (Chicago)")
sub_kord_data %>%
select(year, month, modSLP) %>%
filter(!is.na(modSLP)) %>%
group_by(year, month) %>%
summarize(medSLP=median(modSLP)) %>%
group_by(month) %>%
summarize(sdSLP=sd(medSLP)) %>%
ggplot(aes(x=month, y=sdSLP)) +
geom_point() +
labs(x="",
y="Standard Deviation of Median modSLP by Year",
title="Variability by year of modified Sea-Level Pressure by Month (Chicago)"
)
There are meaningful differences in median modified sea-level pressure by month by year. This is suggestive that several year of data are needed to define an archetype, as there is risk that otherwise the archetype will be an anomalous “high pressure was in contol in June” when that is not generally applicable. And, since high vs. low pressure are often related to temperature, dew point, coludiness, wind speed, and the like, the need for multiple years of data to define an archetype likely extends to all the other variables as well.
With greater variability in SLP by year in the cold-weather months, predictive ability in the cold-weather months is expected to be higher:
rf_kord_TDmcwha$testData %>%
group_by(month) %>%
summarize(pctError=1-mean(correct)) %>%
ggplot(aes(x=month, y=pctError)) +
geom_col(fill="lightblue") +
geom_text(aes(y=pctError/2, label=paste0(round(100*pctError, 1), "%"))) +
labs(x="", y="Error Rate", title="Prediction Error Rate by Month (Chicago)")
As expected, the model takes advantage of greater annual variation in modSLP during the cold season to make better prediction of which year it is during the cold season.
Next an attempt is made to predict the month given the data:
# Run random forest for Las Vegas 2014-2019 data
rf_klas_TDycwha_month_50 <- rfMultiLocale(sub_klas_data,
vrbls=c("TempF", "DewF",
"year", "hr",
"minHeight", "ceilingHeight",
"WindSpeed", "predomDir",
"modSLP"
),
locs=NULL,
locVar="month",
pred="month",
ntree=50,
seed=2007021356,
mtry=4
)
##
## Running for locations:
## [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"
Prediction accuracy and variable importance can then be assessed:
# Evaluate prediction accuracy
evalPredictions(rf_klas_TDycwha_month_50,
plotCaption = "Temp, Dew Point, Year, Hour of Day, Cloud Height, Wind, Altimeter",
keyVar="month",
locOrder=TRUE
)
## # A tibble: 105 x 5
## locale predicted correct n pct
## <fct> <fct> <lgl> <int> <dbl>
## 1 Jan Jan TRUE 1087 0.809
## 2 Jan Feb FALSE 50 0.0372
## 3 Jan Mar FALSE 27 0.0201
## 4 Jan Apr FALSE 6 0.00447
## 5 Jan May FALSE 5 0.00372
## 6 Jan Nov FALSE 50 0.0372
## 7 Jan Dec FALSE 118 0.0879
## 8 Feb Jan FALSE 75 0.0626
## 9 Feb Feb TRUE 820 0.684
## 10 Feb Mar FALSE 105 0.0876
## # ... with 95 more rows
# Evaluate variable importance
helperPlotVarImp(rf_klas_TDycwha_month_50$rfModel)
# Evaluate error evolution
errorEvolution(rf_klas_TDycwha_month_50, useCategory="OOB", subT="Las Vegas (2014-2019)")
## # A tibble: 50 x 3
## ntree Category Error
## <int> <chr> <dbl>
## 1 1 OOB 0.536
## 2 2 OOB 0.533
## 3 3 OOB 0.528
## 4 4 OOB 0.521
## 5 5 OOB 0.514
## 6 6 OOB 0.501
## 7 7 OOB 0.494
## 8 8 OOB 0.481
## 9 9 OOB 0.472
## 10 10 OOB 0.462
## # ... with 40 more rows
There is meaningful seasonality to the Las Vegas data such that the predicted month is frequently either the same season or (for spring/fall) the mirror season.
Suppose that year is not available as a predictor:
# Run random forest for Las Vegas 2014-2019 data
rf_klas_TDcwha_month_50 <- rfMultiLocale(sub_klas_data,
vrbls=c("TempF", "DewF",
"hr",
"minHeight", "ceilingHeight",
"WindSpeed", "predomDir",
"modSLP"
),
locs=NULL,
locVar="month",
pred="month",
ntree=50,
seed=2007021402,
mtry=4
)
##
## Running for locations:
## [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"
Prediction accuracy and variable importance can then be assessed:
# Evaluate prediction accuracy
evalPredictions(rf_klas_TDcwha_month_50,
plotCaption = "Temp, Dew Point, Hour of Day, Cloud Height, Wind, Altimeter",
keyVar="month",
locOrder=TRUE
)
## # A tibble: 105 x 5
## locale predicted correct n pct
## <fct> <fct> <lgl> <int> <dbl>
## 1 Jan Jan TRUE 920 0.697
## 2 Jan Feb FALSE 85 0.0644
## 3 Jan Mar FALSE 33 0.025
## 4 Jan Apr FALSE 6 0.00455
## 5 Jan May FALSE 5 0.00379
## 6 Jan Nov FALSE 70 0.0530
## 7 Jan Dec FALSE 201 0.152
## 8 Feb Jan FALSE 112 0.0945
## 9 Feb Feb TRUE 612 0.516
## 10 Feb Mar FALSE 139 0.117
## # ... with 95 more rows
# Evaluate variable importance
helperPlotVarImp(rf_klas_TDcwha_month_50$rfModel)
# Evaluate error evolution
errorEvolution(rf_klas_TDcwha_month_50, useCategory="OOB", subT="Las Vegas (2014-2019)")
## # A tibble: 50 x 3
## ntree Category Error
## <int> <chr> <dbl>
## 1 1 OOB 0.618
## 2 2 OOB 0.616
## 3 3 OOB 0.611
## 4 4 OOB 0.605
## 5 5 OOB 0.599
## 6 6 OOB 0.593
## 7 7 OOB 0.587
## 8 8 OOB 0.580
## 9 9 OOB 0.575
## 10 10 OOB 0.565
## # ... with 40 more rows
The same seasonal pattern is observed, though prediction accuracy falls be ~15%. This is suggestive that it is not uncommon for Las Vegas climate to run a month-ahead or a month-behind where it ran in a previous year.
The model appears to still be learning at 50 trees, so it is expanded to 150 trees:
# Run random forest for Las Vegas 2014-2019 data
rf_klas_TDcwha_month_150 <- rfMultiLocale(sub_klas_data,
vrbls=c("TempF", "DewF",
"hr",
"minHeight", "ceilingHeight",
"WindSpeed", "predomDir",
"modSLP"
),
locs=NULL,
locVar="month",
pred="month",
ntree=150,
seed=2007021402,
mtry=4
)
##
## Running for locations:
## [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"
Prediction accuracy and variable importance can then be assessed:
# Evaluate prediction accuracy
evalPredictions(rf_klas_TDcwha_month_150,
plotCaption = "Temp, Dew Point, Hour of Day, Cloud Height, Wind, Altimeter",
keyVar="month",
locOrder=TRUE
)
## # A tibble: 105 x 5
## locale predicted correct n pct
## <fct> <fct> <lgl> <int> <dbl>
## 1 Jan Jan TRUE 927 0.702
## 2 Jan Feb FALSE 75 0.0568
## 3 Jan Mar FALSE 38 0.0288
## 4 Jan Apr FALSE 6 0.00455
## 5 Jan May FALSE 5 0.00379
## 6 Jan Nov FALSE 69 0.0523
## 7 Jan Dec FALSE 200 0.152
## 8 Feb Jan FALSE 91 0.0768
## 9 Feb Feb TRUE 623 0.526
## 10 Feb Mar FALSE 131 0.111
## # ... with 95 more rows
# Evaluate variable importance
helperPlotVarImp(rf_klas_TDcwha_month_150$rfModel)
# Evaluate error evolution
errorEvolution(rf_klas_TDcwha_month_150, useCategory="OOB", subT="Las Vegas (2014-2019)")
## # A tibble: 150 x 3
## ntree Category Error
## <int> <chr> <dbl>
## 1 1 OOB 0.618
## 2 2 OOB 0.616
## 3 3 OOB 0.611
## 4 4 OOB 0.605
## 5 5 OOB 0.599
## 6 6 OOB 0.593
## 7 7 OOB 0.587
## 8 8 OOB 0.580
## 9 9 OOB 0.575
## 10 10 OOB 0.565
## # ... with 140 more rows
Another topic of interest is the use of sea-level pressure or altimeter in modeling. The altimeter setting in a METAR is captured in two ways:
In theory, both metrics report almost the same thing but in different units. There is a small difference in that SLP makes an adjustment for average 12-hour temperature (since temperature is a driver of air pressure) while altimeter is purely an adjustment for altitude.
How does the model perform with neither/either/both or Altimeter and modSLP included? An example will be used for trying to classify a handful of cities from 2016:
sub_mini_2016_data <- metarData %>%
mutate(hr=lubridate::hour(dtime)) %>%
filter(year==2016,
str_sub(source, 1 ,4) %in% c("kord", "kmke", "kmsy", "klas", "ksan", "kden", "kbos", "ksea")
)
First, a model is run with neither Altimeter nor SLP:
# Run random forest for subset data (2016)
rf_mini_TDmcwh <- rfMultiLocale(sub_mini_2016_data,
vrbls=c("TempF", "DewF",
"month", "hr",
"minHeight", "ceilingHeight",
"WindSpeed", "predomDir"
),
locs=NULL,
locVar="locale",
pred="locale",
ntree=100,
seed=2007021422,
mtry=4
)
##
## Running for locations:
## [1] "Boston, MA (2016)" "Chicago, IL (2016)" "Denver, CO (2016)"
## [4] "Las Vegas, NV (2016)" "Milwaukee, WI (2016)" "New Orleans, LA (2016)"
## [7] "San Diego, CA (2016)" "Seattle, WA (2016)"
Prediction accuracy and variable importance can then be assessed:
# Evaluate prediction accuracy
evalPredictions(rf_mini_TDmcwh,
plotCaption = "Temp, Dew Point, Month, Hour of Day, Cloud Height, Wind",
keyVar="locale",
locOrder=NULL
)
## # A tibble: 64 x 5
## locale predicted correct n pct
## <fct> <fct> <lgl> <int> <dbl>
## 1 Boston, MA (2016) Boston, MA (2016) TRUE 1800 0.695
## 2 Boston, MA (2016) Chicago, IL (2016) FALSE 186 0.0718
## 3 Boston, MA (2016) Denver, CO (2016) FALSE 158 0.0610
## 4 Boston, MA (2016) Las Vegas, NV (2016) FALSE 25 0.00965
## 5 Boston, MA (2016) Milwaukee, WI (2016) FALSE 204 0.0788
## 6 Boston, MA (2016) New Orleans, LA (2016) FALSE 39 0.0151
## 7 Boston, MA (2016) San Diego, CA (2016) FALSE 41 0.0158
## 8 Boston, MA (2016) Seattle, WA (2016) FALSE 137 0.0529
## 9 Chicago, IL (2016) Boston, MA (2016) FALSE 167 0.0641
## 10 Chicago, IL (2016) Chicago, IL (2016) TRUE 1538 0.590
## # ... with 54 more rows
# Evaluate variable importance
helperPlotVarImp(rf_mini_TDmcwh$rfModel)
# Evaluate error evolution
errorEvolution(rf_mini_TDmcwh, useCategory="OOB", subT="No Altimeter, No SLP")
## # A tibble: 100 x 3
## ntree Category Error
## <int> <chr> <dbl>
## 1 1 OOB 0.374
## 2 2 OOB 0.373
## 3 3 OOB 0.369
## 4 4 OOB 0.363
## 5 5 OOB 0.355
## 6 6 OOB 0.345
## 7 7 OOB 0.335
## 8 8 OOB 0.328
## 9 9 OOB 0.317
## 10 10 OOB 0.309
## # ... with 90 more rows
Overall accuracy is roughly 75%, with the greatest challenge being separating Chicago and Milwaukee, with Boston also showing some similarities to these cities.
The process is then run with both Altimeter and SLP:
# Run random forest for subset data (2016)
rf_mini_TDmcwhas <- rfMultiLocale(sub_mini_2016_data,
vrbls=c("TempF", "DewF",
"month", "hr",
"minHeight", "ceilingHeight",
"WindSpeed", "predomDir",
"modSLP", "Altimeter"
),
locs=NULL,
locVar="locale",
pred="locale",
ntree=100,
seed=2007021429,
mtry=4
)
##
## Running for locations:
## [1] "Boston, MA (2016)" "Chicago, IL (2016)" "Denver, CO (2016)"
## [4] "Las Vegas, NV (2016)" "Milwaukee, WI (2016)" "New Orleans, LA (2016)"
## [7] "San Diego, CA (2016)" "Seattle, WA (2016)"
Prediction accuracy and variable importance can again be assessed:
# Evaluate prediction accuracy
evalPredictions(rf_mini_TDmcwhas,
plotCaption = "Temp, Dew Point, Month, Hour of Day, Cloud Height, Wind, Altimeter, and SLP",
keyVar="locale",
locOrder=NULL
)
## Warning: Removed 1 rows containing missing values (geom_text).
## # A tibble: 62 x 5
## locale predicted correct n pct
## <fct> <fct> <lgl> <int> <dbl>
## 1 Boston, MA (2016) Boston, MA (2016) TRUE 2131 0.837
## 2 Boston, MA (2016) Chicago, IL (2016) FALSE 90 0.0354
## 3 Boston, MA (2016) Denver, CO (2016) FALSE 44 0.0173
## 4 Boston, MA (2016) Las Vegas, NV (2016) FALSE 25 0.00982
## 5 Boston, MA (2016) Milwaukee, WI (2016) FALSE 109 0.0428
## 6 Boston, MA (2016) New Orleans, LA (2016) FALSE 18 0.00707
## 7 Boston, MA (2016) San Diego, CA (2016) FALSE 30 0.0118
## 8 Boston, MA (2016) Seattle, WA (2016) FALSE 98 0.0385
## 9 Chicago, IL (2016) Boston, MA (2016) FALSE 112 0.0412
## 10 Chicago, IL (2016) Chicago, IL (2016) TRUE 1852 0.681
## # ... with 52 more rows
# Evaluate variable importance
helperPlotVarImp(rf_mini_TDmcwhas$rfModel)
# Evaluate error evolution
errorEvolution(rf_mini_TDmcwhas, useCategory="OOB", subT="Both Altimeter and SLP")
## # A tibble: 100 x 3
## ntree Category Error
## <int> <chr> <dbl>
## 1 1 OOB 0.316
## 2 2 OOB 0.319
## 3 3 OOB 0.319
## 4 4 OOB 0.310
## 5 5 OOB 0.299
## 6 6 OOB 0.291
## 7 7 OOB 0.282
## 8 8 OOB 0.271
## 9 9 OOB 0.261
## 10 10 OOB 0.251
## # ... with 90 more rows
Overall accuracy increases to about 85%. Except for Chicago/Milwaukee, every city is overwhelmingly classified as itself.
Smaller runs with just SLP and just Altimeter are conducted, to verify that the learning rate is roughly the same in these cases. As well SLP plus a dummy variable (SLP + rnorm()) is run:
# Run random forest for subset data (2016)
rf_mini_TDmcwhs <- rfMultiLocale(sub_mini_2016_data,
vrbls=c("TempF", "DewF",
"month", "hr",
"minHeight", "ceilingHeight",
"WindSpeed", "predomDir",
"modSLP"
),
locs=NULL,
locVar="locale",
pred="locale",
ntree=50,
seed=2007021435,
mtry=4
)
##
## Running for locations:
## [1] "Boston, MA (2016)" "Chicago, IL (2016)" "Denver, CO (2016)"
## [4] "Las Vegas, NV (2016)" "Milwaukee, WI (2016)" "New Orleans, LA (2016)"
## [7] "San Diego, CA (2016)" "Seattle, WA (2016)"
# Run random forest for subset data (2016)
rf_mini_TDmcwha <- rfMultiLocale(sub_mini_2016_data,
vrbls=c("TempF", "DewF",
"month", "hr",
"minHeight", "ceilingHeight",
"WindSpeed", "predomDir",
"Altimeter"
),
locs=NULL,
locVar="locale",
pred="locale",
ntree=50,
seed=2007021440,
mtry=4
)
##
## Running for locations:
## [1] "Boston, MA (2016)" "Chicago, IL (2016)" "Denver, CO (2016)"
## [4] "Las Vegas, NV (2016)" "Milwaukee, WI (2016)" "New Orleans, LA (2016)"
## [7] "San Diego, CA (2016)" "Seattle, WA (2016)"
set.seed(2007021444)
subUseData <- sub_mini_2016_data %>%
mutate(dummy=modSLP + rnorm(n=nrow(sub_mini_2016_data)))
# Run random forest for subset data (2016)
rf_mini_TDmcwhsd <- rfMultiLocale(subUseData,
vrbls=c("TempF", "DewF",
"month", "hr",
"minHeight", "ceilingHeight",
"WindSpeed", "predomDir",
"modSLP", "dummy"
),
locs=NULL,
locVar="locale",
pred="locale",
ntree=50,
seed=2007021445,
mtry=4
)
##
## Running for locations:
## [1] "Boston, MA (2016)" "Chicago, IL (2016)" "Denver, CO (2016)"
## [4] "Las Vegas, NV (2016)" "Milwaukee, WI (2016)" "New Orleans, LA (2016)"
## [7] "San Diego, CA (2016)" "Seattle, WA (2016)"
OOB error evolution through the first 50 trees can be compared:
# Compare error evolution through the first 50 trees
tibble::tibble(ntree=1:50,
neither=rf_mini_TDmcwh$rfModel$err.rate[1:50, "OOB"],
altimeter=rf_mini_TDmcwha$rfModel$err.rate[1:50, "OOB"],
modslp=rf_mini_TDmcwhs$rfModel$err.rate[1:50, "OOB"],
slp_dummy=rf_mini_TDmcwhsd$rfModel$err.rate[1:50, "OOB"],
both=rf_mini_TDmcwhas$rfModel$err.rate[1:50, "OOB"]
) %>%
pivot_longer(-ntree, names_to="Model", values_to="Error") %>%
ggplot(aes(x=ntree, y=Error, color=Model)) +
geom_line() +
geom_text(data=~filter(., ntree %in% c(1, 50)),
aes(y=Error+0.02, label=paste0(round(100*Error, 1), "%"))
) +
labs(x="# Trees",
y="OOB Error Rate",
title="OOB Error Evolution for SLP Included/Excluded"
) +
ylim(c(0, NA))
Interestingly, the inclusion of both Altimeter and modSLP appears to drive a 2%-3% decrease in classification error. The main difference between the variables is that modSLP contains a hint about the previous 12-hour temperature.
Importantly, including a dummy variable with modSLP has either no impact or a slight negative impact on model performance. So, the addition of Altimeter to modSLP is likely a real impact, not merely the inclusion of two highly correlated variables that are both valuable to driving model performance.
Next, an attempt is made to compare every grouping of two cities, using all variables, mtry of 4, and a very small forest of 15 trees:
# Create a container list to hold the output
list_varimp_2016 <- vector("list", 0.5*length(names_2016)*(length(names_2016)-1))
# Set a random seed
set.seed(2007031342)
# Loop through all possible combinations
n <- 1
for (ctr in 1:(length(names_2016)-1)) {
for (ctr2 in (ctr+1):length(names_2016)) {
list_varimp_2016[[n]] <- rfTwoLocales(mutate(metarData, hr=lubridate::hour(dtime)),
loc1=names_2016[ctr],
loc2=names_2016[ctr2],
vrbls=c("TempF", "DewF",
"month", "hr",
"minHeight", "ceilingHeight",
"WindSpeed", "predomDir",
"modSLP", "Altimeter"
),
ntree=15,
mtry=4
)
n <- n + 1
if ((n %% 40) == 0) { cat("Through number:", n, "\n")}
}
}
## Through number: 40
## Through number: 80
## Through number: 120
## Through number: 160
## Through number: 200
## Through number: 240
## Through number: 280
## Through number: 320
## Through number: 360
## Through number: 400
# Create a tibble from the underlying accuracy data
acc_varimp_2016 <- map_dfr(list_varimp_2016, .f=helperAccuracyLocale)
# Assess the top 20 classification accuracies
acc_varimp_2016 %>%
arrange(-accOverall) %>%
head(20)
## # A tibble: 20 x 5
## locale1 locale2 accOverall accLocale1 accLocale2
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 Denver, CO (2016) Miami, FL (2016) 0.998 0.998 0.998
## 2 Denver, CO (2016) Tampa Bay, FL (2016) 0.996 0.998 0.995
## 3 Las Vegas, NV (2016) Miami, FL (2016) 0.995 0.995 0.995
## 4 Denver, CO (2016) New Orleans, LA (201~ 0.995 0.996 0.994
## 5 Denver, CO (2016) San Diego, CA (2016) 0.994 0.994 0.994
## 6 Denver, CO (2016) Houston, TX (2016) 0.993 0.994 0.992
## 7 Denver, CO (2016) San Francisco, CA (2~ 0.993 0.994 0.992
## 8 Miami, FL (2016) Seattle, WA (2016) 0.992 0.990 0.994
## 9 Miami, FL (2016) Phoenix, AZ (2016) 0.992 0.991 0.992
## 10 Miami, FL (2016) Traverse City, MI (2~ 0.991 0.993 0.990
## 11 Miami, FL (2016) Minneapolis, MN (201~ 0.991 0.991 0.990
## 12 Boston, MA (2016) Miami, FL (2016) 0.991 0.990 0.991
## 13 Denver, CO (2016) Los Angeles, CA (201~ 0.991 0.993 0.989
## 14 Denver, CO (2016) San Jose, CA (2016) 0.990 0.991 0.989
## 15 Denver, CO (2016) San Antonio, TX (201~ 0.990 0.992 0.988
## 16 Grand Rapids, MI (201~ Miami, FL (2016) 0.990 0.990 0.990
## 17 Green Bay, WI (2016) Miami, FL (2016) 0.990 0.989 0.990
## 18 Miami, FL (2016) Milwaukee, WI (2016) 0.989 0.989 0.989
## 19 Madison, WI (2016) Miami, FL (2016) 0.989 0.986 0.992
## 20 Las Vegas, NV (2016) Tampa Bay, FL (2016) 0.989 0.990 0.987
# Assess the bottom 20 classification accuracies
acc_varimp_2016 %>%
arrange(accOverall) %>%
head(20)
## # A tibble: 20 x 5
## locale1 locale2 accOverall accLocale1 accLocale2
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 Chicago, IL (2016) Milwaukee, WI (2016) 0.675 0.685 0.666
## 2 Newark, NJ (2016) Philadelphia, PA (20~ 0.676 0.687 0.665
## 3 Detroit, MI (2016) Grand Rapids, MI (20~ 0.712 0.718 0.706
## 4 Madison, WI (2016) Milwaukee, WI (2016) 0.718 0.725 0.711
## 5 Philadelphia, PA (201~ Washington, DC (2016) 0.730 0.741 0.718
## 6 Chicago, IL (2016) Grand Rapids, MI (20~ 0.731 0.750 0.711
## 7 Green Bay, WI (2016) Madison, WI (2016) 0.737 0.747 0.728
## 8 Chicago, IL (2016) Detroit, MI (2016) 0.740 0.746 0.735
## 9 Chicago, IL (2016) Madison, WI (2016) 0.745 0.756 0.733
## 10 Grand Rapids, MI (201~ Milwaukee, WI (2016) 0.746 0.747 0.744
## 11 Green Bay, WI (2016) Milwaukee, WI (2016) 0.754 0.757 0.751
## 12 Madison, WI (2016) Minneapolis, MN (201~ 0.758 0.759 0.757
## 13 Grand Rapids, MI (201~ Madison, WI (2016) 0.759 0.763 0.755
## 14 Detroit, MI (2016) Milwaukee, WI (2016) 0.765 0.777 0.753
## 15 Grand Rapids, MI (201~ Traverse City, MI (2~ 0.769 0.770 0.768
## 16 Detroit, MI (2016) Indianapolis, IN (20~ 0.772 0.775 0.770
## 17 Chicago, IL (2016) Indianapolis, IN (20~ 0.775 0.777 0.774
## 18 Boston, MA (2016) Newark, NJ (2016) 0.778 0.781 0.776
## 19 Indianapolis, IN (201~ Saint Louis, MO (201~ 0.779 0.791 0.768
## 20 Madison, WI (2016) Traverse City, MI (2~ 0.783 0.782 0.784
As pre previous, the best accuracies are obtained when comparing cities in very different climates (e.g., Denver vs. Humid/Marine or Miami vs. Desert/Cold), while the worst accuracies are obtained when comparing very similar cities (e.g., Chicago and Milwaukee or Newar and Philadelphia).
Variable importance can then be assessed across all 1:1 classifications:
# Create a tibble of all the variable importance data
val_varimp_2016 <- map_dfr(list_varimp_2016,
.f=function(x) { x$rfModel %>%
caret::varImp() %>%
t() %>%
as.data.frame()
}
) %>%
tibble::as_tibble()
# Create boxplot of overall variable importance
val_varimp_2016 %>%
mutate(num=1:nrow(val_varimp_2016)) %>%
pivot_longer(-num, names_to="variable", values_to="varImp") %>%
ggplot(aes(x=fct_reorder(variable, varImp), y=varImp)) +
geom_boxplot(fill="lightblue") +
labs(x="",
y="Variable Importance",
title="Variable Importance for 1:1 Random Forest Classifications"
)
# Attach the city names and OOB error rate
tbl_varimp_2016 <- sapply(list_varimp_2016,
FUN=function(x) { c(names(x$errorRate[2:3]), x$errorRate["OOB"]) }
) %>%
t() %>%
as.data.frame() %>%
bind_cols(val_varimp_2016) %>%
tibble::as_tibble() %>%
mutate(OOB=as.numeric(as.character(OOB))) %>%
rename(locale1=V1,
locale2=V2
)
# Plot accuracy vs. spikiness of variable importance
tbl_varimp_2016 %>%
pivot_longer(-c(locale1, locale2, OOB), names_to="var", values_to="varImp") %>%
group_by(locale1, locale2, OOB) %>%
summarize(mean=mean(varImp), max=max(varImp)) %>%
mutate(maxMean=max/mean) %>%
ggplot(aes(x=maxMean, y=1-OOB)) +
geom_point() +
geom_smooth(method="loess") +
labs(x="Ratio of Maximum Variable Importance to Mean Variable Importance",
y="OOB Accuracy",
title="Accuracy vs. Spikiness of Variable Importance"
)
Broadly speaking, the same variables that drive overall classification are important in driving 1:1 classifications. There is meaningful spikiness, suggesting that different 1:1 classifications rely on different variables.
There is a strong trend where the best accuracies are obtained where there is a single spiky dimension that drives the classifications. This suggests that while the model can take advantage of all 10 variables, it has the easiest tome when there is a single, well-differentiated variable. No surprise.
An attempt is made to define archetype cities as cities that are either 1) very different from other cities, or 2) very similar to other cities. An archetype should always meet criteria for some meaningful examples, and criteria 2 allows for an archetype to be similar to several other cities of the same archetype:
oobData <- tibble::tibble(City=character(0), OOBLow=numeric(0), OOBHigh=numeric(0))
for (city in cityNameMapper[names_2016]) {
cityData <- tbl_varimp_2016 %>%
filter(locale1 == city | locale2 == city)
lowOOB <- cityData %>%
top_n(n=10, wt=-OOB) %>%
pull(OOB) %>%
mean()
highOOB <- cityData %>%
top_n(n=5, wt=OOB) %>%
pull(OOB) %>%
mean()
cat("\nCity:", city,
"\tOOB (Low 10 Average):", round(lowOOB, 4),
"\tOOB (High 5 Average):", round(highOOB, 4)
)
oobData <- bind_rows(oobData, tibble::tibble(City=city, OOBLow=lowOOB, OOBHigh=highOOB))
}
##
## City: Atlanta, GA (2016) OOB (Low 10 Average): 0.0435 OOB (High 5 Average): 0.1349
## City: Boston, MA (2016) OOB (Low 10 Average): 0.0257 OOB (High 5 Average): 0.1907
## City: Washington, DC (2016) OOB (Low 10 Average): 0.0398 OOB (High 5 Average): 0.1903
## City: Denver, CO (2016) OOB (Low 10 Average): 0.0076 OOB (High 5 Average): 0.0609
## City: Dallas, TX (2016) OOB (Low 10 Average): 0.0409 OOB (High 5 Average): 0.1416
## City: Detroit, MI (2016) OOB (Low 10 Average): 0.0273 OOB (High 5 Average): 0.2446
## City: Newark, NJ (2016) OOB (Low 10 Average): 0.0366 OOB (High 5 Average): 0.2226
## City: Green Bay, WI (2016) OOB (Low 10 Average): 0.0202 OOB (High 5 Average): 0.2269
## City: Grand Rapids, MI (2016) OOB (Low 10 Average): 0.0238 OOB (High 5 Average): 0.2567
## City: Houston, TX (2016) OOB (Low 10 Average): 0.0232 OOB (High 5 Average): 0.1403
## City: Indianapolis, IN (2016) OOB (Low 10 Average): 0.0356 OOB (High 5 Average): 0.2127
## City: Las Vegas, NV (2016) OOB (Low 10 Average): 0.0162 OOB (High 5 Average): 0.0595
## City: Los Angeles, CA (2016) OOB (Low 10 Average): 0.0295 OOB (High 5 Average): 0.0943
## City: Lincoln, NE (2016) OOB (Low 10 Average): 0.0303 OOB (High 5 Average): 0.1682
## City: Miami, FL (2016) OOB (Low 10 Average): 0.0082 OOB (High 5 Average): 0.1027
## City: Milwaukee, WI (2016) OOB (Low 10 Average): 0.0228 OOB (High 5 Average): 0.2685
## City: Madison, WI (2016) OOB (Low 10 Average): 0.0243 OOB (High 5 Average): 0.2567
## City: Minneapolis, MN (2016) OOB (Low 10 Average): 0.022 OOB (High 5 Average): 0.205
## City: New Orleans, LA (2016) OOB (Low 10 Average): 0.0164 OOB (High 5 Average): 0.1368
## City: Chicago, IL (2016) OOB (Low 10 Average): 0.0275 OOB (High 5 Average): 0.2667
## City: Philadelphia, PA (2016) OOB (Low 10 Average): 0.0398 OOB (High 5 Average): 0.2211
## City: Phoenix, AZ (2016) OOB (Low 10 Average): 0.0158 OOB (High 5 Average): 0.0647
## City: San Diego, CA (2016) OOB (Low 10 Average): 0.0226 OOB (High 5 Average): 0.0843
## City: San Antonio, TX (2016) OOB (Low 10 Average): 0.0265 OOB (High 5 Average): 0.1022
## City: Seattle, WA (2016) OOB (Low 10 Average): 0.021 OOB (High 5 Average): 0.0756
## City: San Francisco, CA (2016) OOB (Low 10 Average): 0.0244 OOB (High 5 Average): 0.0895
## City: San Jose, CA (2016) OOB (Low 10 Average): 0.0334 OOB (High 5 Average): 0.1003
## City: Saint Louis, MO (2016) OOB (Low 10 Average): 0.0396 OOB (High 5 Average): 0.1737
## City: Tampa Bay, FL (2016) OOB (Low 10 Average): 0.0136 OOB (High 5 Average): 0.1332
## City: Traverse City, MI (2016) OOB (Low 10 Average): 0.0221 OOB (High 5 Average): 0.2158
# Plot OOB data summary
oobData %>%
ggplot(aes(x=OOBLow, y=OOBHigh)) +
geom_text(aes(label=City)) +
labs(x="Average OOB Error of Best 10 1:1 Predictions",
y="Average OOB Error of Worst 5 1:1 Predictions"
)
At a glance, Miami, Denver and Phoenix/Las Vegas make for great archetypes in this data. They are broadly dissimilar from a large number of cities and not all that similar to many other cities. New Orleans or Tampa may also make for good archetypes (somewhat surprisingly in that these should be similar to Miami).
Suppose that hierarchical clustering is attempted, using 1-OOB as the main ‘distance’ variable:
# Create both halves of what will become the distance matric
upperHalf <- tbl_varimp_2016 %>%
mutate(locale1=as.character(locale1), locale2=as.character(locale2))
lowerHalf <- upperHalf %>%
rename(locale2=locale1, locale1=locale2) %>%
select(locale1, locale2, everything())
bothHalf <- upperHalf %>%
bind_rows(lowerHalf)
# Select only locale1, locale2, OOB and create a record for each locale to itself
mtxData <- bothHalf %>%
select(locale1, locale2, OOB) %>%
bind_rows(tibble::tibble(locale1=unique(bothHalf$locale1), locale2=locale1, OOB=NA)) %>%
arrange(locale1, locale2) %>%
mutate(accuracy=1-OOB)
# Create the matrix and add row/column names
mtxOOB <- matrix(data=mtxData$accuracy, nrow=sqrt(nrow(mtxData)), ncol=sqrt(nrow(mtxData)), byrow=TRUE)
dimnames(mtxOOB) <- list(unique(mtxData$locale1), unique(mtxData$locale1))
# Convert to distance matrix
distOOB <- as.dist(mtxOOB)
# Run hierarchical clustering
distOOB %>%
hclust(method="complete") %>%
plot()
Seattle and Denver appear to be good stand-alone cities. They are not particularly close to any other cities.
Suppose that an attempt is made to classify cities as Seattle, Denver, or Other:
# Create data for Denver/Seattle/Other
sub_densea_data <- metarData %>%
mutate(place=case_when(source=="kden_2016" ~ "Denver",
source=="ksea_2016" ~ "Seattle",
TRUE ~ "Other"
)
) %>%
filter(year==2016)
# Down-sample all to smallest 'place'
nSmallPlace <- sub_densea_data %>%
count(place) %>%
pull(n) %>%
min()
# Group the data and sample to only this minimum size
set.seed(2007031504)
sub_densea_data <- sub_densea_data %>%
group_by(place) %>%
sample_n(size=nSmallPlace) %>%
ungroup() %>%
mutate(hr=lubridate::hour(dtime))
# Run random forest for subset data
rf_densea_TDmcwhsa <- rfMultiLocale(sub_densea_data,
vrbls=c("TempF", "DewF",
"month", "hr",
"minHeight", "ceilingHeight",
"WindSpeed", "predomDir",
"modSLP", "Altimeter"
),
locs=NULL,
locVar="place",
pred="place",
ntree=100,
seed=2007031510,
mtry=4
)
##
## Running for locations:
## [1] "Denver" "Other" "Seattle"
# Evaluate prediction accuracy
evalPredictions(rf_densea_TDmcwhsa,
plotCaption = "Temp, Dew Point, Month, Hour of Day, Cloud Height, Wind, Altimeter, and SLP",
keyVar="place",
locOrder=NULL
)
## Warning: Removed 1 rows containing missing values (geom_text).
## # A tibble: 9 x 5
## locale predicted correct n pct
## <fct> <fct> <lgl> <int> <dbl>
## 1 Denver Denver TRUE 2522 0.960
## 2 Denver Other FALSE 83 0.0316
## 3 Denver Seattle FALSE 21 0.00800
## 4 Other Denver FALSE 180 0.0685
## 5 Other Other TRUE 2182 0.831
## 6 Other Seattle FALSE 265 0.101
## 7 Seattle Denver FALSE 22 0.00843
## 8 Seattle Other FALSE 115 0.0441
## 9 Seattle Seattle TRUE 2472 0.947
# Evaluate variable importance
helperPlotVarImp(rf_densea_TDmcwhsa$rfModel)
# Evaluate error evolution
errorEvolution(rf_densea_TDmcwhsa, subT="Denver, Seattle, Other")
## # A tibble: 400 x 3
## ntree Category Error
## <int> <chr> <dbl>
## 1 1 OOB 0.180
## 2 1 Denver 0.115
## 3 1 Other 0.264
## 4 1 Seattle 0.162
## 5 2 OOB 0.191
## 6 2 Denver 0.135
## 7 2 Other 0.268
## 8 2 Seattle 0.169
## 9 3 OOB 0.189
## 10 3 Denver 0.122
## # ... with 390 more rows
The model is very good at picking out Seattle and Denver, but not so good at picking out that Other is not Seattle or Denver. A different approach may be needed to find stand-alone cities and archetypes.
An initial attempt is made to classify all of the 2016 cities, using a small forest of 25 trees, with the objective of finding cities that are frequently classified as each other:
# Run random forest for subset data
rf_all_nt25_TDmcwhsa <- rfMultiLocale(filter(mutate(metarData, hr=lubridate::hour(dtime)), year==2016),
vrbls=c("TempF", "DewF",
"month", "hr",
"minHeight", "ceilingHeight",
"WindSpeed", "predomDir",
"modSLP", "Altimeter"
),
locs=NULL,
locVar="locale",
pred="locale",
ntree=25,
seed=2007041352,
mtry=4
)
##
## Running for locations:
## [1] "Atlanta, GA (2016)" "Boston, MA (2016)"
## [3] "Chicago, IL (2016)" "Dallas, TX (2016)"
## [5] "Denver, CO (2016)" "Detroit, MI (2016)"
## [7] "Grand Rapids, MI (2016)" "Green Bay, WI (2016)"
## [9] "Houston, TX (2016)" "Indianapolis, IN (2016)"
## [11] "Las Vegas, NV (2016)" "Lincoln, NE (2016)"
## [13] "Los Angeles, CA (2016)" "Madison, WI (2016)"
## [15] "Miami, FL (2016)" "Milwaukee, WI (2016)"
## [17] "Minneapolis, MN (2016)" "New Orleans, LA (2016)"
## [19] "Newark, NJ (2016)" "Philadelphia, PA (2016)"
## [21] "Phoenix, AZ (2016)" "Saint Louis, MO (2016)"
## [23] "San Antonio, TX (2016)" "San Diego, CA (2016)"
## [25] "San Francisco, CA (2016)" "San Jose, CA (2016)"
## [27] "Seattle, WA (2016)" "Tampa Bay, FL (2016)"
## [29] "Traverse City, MI (2016)" "Washington, DC (2016)"
Accuracy can then be assessed:
# Evaluate prediction accuracy
evalPredictions(rf_all_nt25_TDmcwhsa,
plotCaption = "Temp, Dew Point, Month, Hour of Day, Cloud Height, Wind, Altimeter, and SLP",
keyVar="locale",
locOrder=NULL
)
## # A tibble: 863 x 5
## locale predicted correct n pct
## <fct> <fct> <lgl> <int> <dbl>
## 1 Atlanta, GA (2016) Atlanta, GA (2016) TRUE 1776 0.663
## 2 Atlanta, GA (2016) Boston, MA (2016) FALSE 12 0.00448
## 3 Atlanta, GA (2016) Chicago, IL (2016) FALSE 12 0.00448
## 4 Atlanta, GA (2016) Dallas, TX (2016) FALSE 91 0.0340
## 5 Atlanta, GA (2016) Denver, CO (2016) FALSE 10 0.00373
## 6 Atlanta, GA (2016) Detroit, MI (2016) FALSE 20 0.00747
## 7 Atlanta, GA (2016) Grand Rapids, MI (2016) FALSE 9 0.00336
## 8 Atlanta, GA (2016) Green Bay, WI (2016) FALSE 8 0.00299
## 9 Atlanta, GA (2016) Houston, TX (2016) FALSE 43 0.0161
## 10 Atlanta, GA (2016) Indianapolis, IN (2016) FALSE 44 0.0164
## # ... with 853 more rows
# Evaluate variable importance
helperPlotVarImp(rf_all_nt25_TDmcwhsa$rfModel)
# Evaluate error evolution
errorEvolution(rf_all_nt25_TDmcwhsa, useCategory="OOB", subT="All 30 cities individually")
## # A tibble: 25 x 3
## ntree Category Error
## <int> <chr> <dbl>
## 1 1 OOB 0.604
## 2 2 OOB 0.606
## 3 3 OOB 0.603
## 4 4 OOB 0.597
## 5 5 OOB 0.589
## 6 6 OOB 0.580
## 7 7 OOB 0.568
## 8 8 OOB 0.555
## 9 9 OOB 0.545
## 10 10 OOB 0.535
## # ... with 15 more rows
There are frequently misclassified cities that might benefit from being grouped and classified as such:
Encouragingly, this is broadly the same as the findings from the hierarchical clustering. The model is still learning at a decent rate, so the 25-tree forest is, as expected, too small.
A mapping file is created to map relevant cities together:
# Create the locale mapper
locMapperTibble_ss01 <- tibble::tribble(
~source, ~locType,
'katl_2016', 'Same',
'kbos_2016', 'Same',
'kdca_2016', 'DC-PHL-EWR (2016)',
'kden_2016', 'Same',
'kdfw_2016', 'Same',
'kdtw_2016', 'Detroit-Grand Rapids (2016)',
'kewr_2016', 'DC-PHL-EWR (2016)',
'kgrb_2016', 'Same',
'kgrr_2016', 'Detroit-Grand Rapids (2016)',
'kiah_2016', 'Houston-New Orleans (2016)',
'kind_2016', 'Same',
'klas_2016', 'Las Vegas-Phoenix (2016)',
'klax_2016', 'Los Angeles-San Diego (2016)',
'klnk_2016', 'Same',
'kmia_2016', 'Miami-Tampa Bay (2016)',
'kmke_2016', 'Chicago-Milwaukee (2016)',
'kmsn_2016', 'Same',
'kmsp_2016', 'Same',
'kmsy_2016', 'Houston-New Orleans (2016)',
'kord_2016', 'Chicago-Milwaukee (2016)',
'kphl_2016', 'DC-PHL-EWR (2016)',
'kphx_2016', 'Las Vegas-Phoenix (2016)',
'ksan_2016', 'Los Angeles-San Diego (2016)',
'ksat_2016', 'Same',
'ksea_2016', 'Same',
'ksfo_2016', 'Bay Area (2016)',
'ksjc_2016', 'Bay Area (2016)',
'kstl_2016', 'Same',
'ktpa_2016', 'Miami-Tampa Bay (2016)',
'ktvc_2016', 'Same'
)
# Create locMapper
locMapper_ss01 <- locMapperTibble_ss01 %>% pull(locType)
names(locMapper_ss01) <- locMapperTibble_ss01 %>% pull(source)
locMapper_ss01
## katl_2016 kbos_2016
## "Same" "Same"
## kdca_2016 kden_2016
## "DC-PHL-EWR (2016)" "Same"
## kdfw_2016 kdtw_2016
## "Same" "Detroit-Grand Rapids (2016)"
## kewr_2016 kgrb_2016
## "DC-PHL-EWR (2016)" "Same"
## kgrr_2016 kiah_2016
## "Detroit-Grand Rapids (2016)" "Houston-New Orleans (2016)"
## kind_2016 klas_2016
## "Same" "Las Vegas-Phoenix (2016)"
## klax_2016 klnk_2016
## "Los Angeles-San Diego (2016)" "Same"
## kmia_2016 kmke_2016
## "Miami-Tampa Bay (2016)" "Chicago-Milwaukee (2016)"
## kmsn_2016 kmsp_2016
## "Same" "Same"
## kmsy_2016 kord_2016
## "Houston-New Orleans (2016)" "Chicago-Milwaukee (2016)"
## kphl_2016 kphx_2016
## "DC-PHL-EWR (2016)" "Las Vegas-Phoenix (2016)"
## ksan_2016 ksat_2016
## "Los Angeles-San Diego (2016)" "Same"
## ksea_2016 ksfo_2016
## "Same" "Bay Area (2016)"
## ksjc_2016 kstl_2016
## "Bay Area (2016)" "Same"
## ktpa_2016 ktvc_2016
## "Miami-Tampa Bay (2016)" "Same"
# Create the data file with locType
locData_ss01 <- metarData %>%
mutate(locType=ifelse(locMapper_ss01[source]=="Same", locale, locMapper_ss01[source]),
hr=lubridate::hour(dtime)) %>%
filter(year==2016, locType !="Exclude")
# Counts by locType
locData_ss01 %>%
count(locType) %>%
arrange(-n) %>%
as.data.frame()
## locType n
## 1 DC-PHL-EWR (2016) 26252
## 2 Las Vegas-Phoenix (2016) 17543
## 3 Miami-Tampa Bay (2016) 17539
## 4 Los Angeles-San Diego (2016) 17536
## 5 Houston-New Orleans (2016) 17535
## 6 Detroit-Grand Rapids (2016) 17534
## 7 Chicago-Milwaukee (2016) 17527
## 8 Bay Area (2016) 17526
## 9 Atlanta, GA (2016) 8772
## 10 Dallas, TX (2016) 8772
## 11 Denver, CO (2016) 8772
## 12 Minneapolis, MN (2016) 8769
## 13 Traverse City, MI (2016) 8766
## 14 Lincoln, NE (2016) 8765
## 15 San Antonio, TX (2016) 8765
## 16 Seattle, WA (2016) 8765
## 17 Green Bay, WI (2016) 8755
## 18 Madison, WI (2016) 8750
## 19 Indianapolis, IN (2016) 8720
## 20 Saint Louis, MO (2016) 8702
## 21 Boston, MA (2016) 8663
The data are then sampled such that each locType is equally prevalent:
# Set a seed for reporducibility
set.seed(2007041419)
# Find the smallest locale type
nSmall <- locData_ss01 %>%
filter(!is.na(locType)) %>%
count(locType) %>%
pull(n) %>%
min()
# Create the relevant data subset
ss01_data <- locData_ss01 %>%
filter(!is.na(locType)) %>%
group_by(locType) %>%
sample_n(size=nSmall, replace=FALSE) %>%
ungroup()
# Sumarize the data subset
ss01_data %>%
count(locale) %>%
arrange(-n) %>%
as.data.frame()
## locale n
## 1 Atlanta, GA (2016) 8663
## 2 Boston, MA (2016) 8663
## 3 Dallas, TX (2016) 8663
## 4 Denver, CO (2016) 8663
## 5 Green Bay, WI (2016) 8663
## 6 Indianapolis, IN (2016) 8663
## 7 Lincoln, NE (2016) 8663
## 8 Madison, WI (2016) 8663
## 9 Minneapolis, MN (2016) 8663
## 10 Saint Louis, MO (2016) 8663
## 11 San Antonio, TX (2016) 8663
## 12 Seattle, WA (2016) 8663
## 13 Traverse City, MI (2016) 8663
## 14 San Francisco, CA (2016) 4400
## 15 Las Vegas, NV (2016) 4376
## 16 Milwaukee, WI (2016) 4362
## 17 Miami, FL (2016) 4360
## 18 New Orleans, LA (2016) 4357
## 19 San Diego, CA (2016) 4345
## 20 Detroit, MI (2016) 4334
## 21 Grand Rapids, MI (2016) 4329
## 22 Los Angeles, CA (2016) 4318
## 23 Houston, TX (2016) 4306
## 24 Tampa Bay, FL (2016) 4303
## 25 Chicago, IL (2016) 4301
## 26 Phoenix, AZ (2016) 4287
## 27 San Jose, CA (2016) 4263
## 28 Philadelphia, PA (2016) 2909
## 29 Washington, DC (2016) 2896
## 30 Newark, NJ (2016) 2858
A random forest can then be run to predict locType (using a slightly bigger forest of 50 trees:
# Run random forest for subset data
rf_s01_nt50_TDmcwhsa <- rfMultiLocale(ss01_data,
vrbls=c("TempF", "DewF",
"month", "hr",
"minHeight", "ceilingHeight",
"WindSpeed", "predomDir",
"modSLP", "Altimeter"
),
locs=NULL,
locVar="locType",
pred="locType",
otherVar=c("dtime", "locale", "source"),
ntree=50,
seed=2007041423,
mtry=4
)
##
## Running for locations:
## [1] "Atlanta, GA (2016)" "Bay Area (2016)"
## [3] "Boston, MA (2016)" "Chicago-Milwaukee (2016)"
## [5] "Dallas, TX (2016)" "DC-PHL-EWR (2016)"
## [7] "Denver, CO (2016)" "Detroit-Grand Rapids (2016)"
## [9] "Green Bay, WI (2016)" "Houston-New Orleans (2016)"
## [11] "Indianapolis, IN (2016)" "Las Vegas-Phoenix (2016)"
## [13] "Lincoln, NE (2016)" "Los Angeles-San Diego (2016)"
## [15] "Madison, WI (2016)" "Miami-Tampa Bay (2016)"
## [17] "Minneapolis, MN (2016)" "Saint Louis, MO (2016)"
## [19] "San Antonio, TX (2016)" "Seattle, WA (2016)"
## [21] "Traverse City, MI (2016)"
Accuracy can then be assessed:
# Evaluate prediction accuracy
evalPredictions(rf_s01_nt50_TDmcwhsa,
plotCaption = "Temp, Dew Point, Month, Hour of Day, Cloud Height, Wind, Altimeter, and SLP",
keyVar="locType",
locOrder=NULL
)
## # A tibble: 432 x 5
## locale predicted correct n pct
## <fct> <fct> <lgl> <int> <dbl>
## 1 Atlanta, GA (2016) Atlanta, GA (2016) TRUE 1840 0.718
## 2 Atlanta, GA (2016) Bay Area (2016) FALSE 60 0.0234
## 3 Atlanta, GA (2016) Boston, MA (2016) FALSE 21 0.00819
## 4 Atlanta, GA (2016) Chicago-Milwaukee (2016) FALSE 17 0.00663
## 5 Atlanta, GA (2016) Dallas, TX (2016) FALSE 83 0.0324
## 6 Atlanta, GA (2016) DC-PHL-EWR (2016) FALSE 60 0.0234
## 7 Atlanta, GA (2016) Denver, CO (2016) FALSE 8 0.00312
## 8 Atlanta, GA (2016) Detroit-Grand Rapids (2016) FALSE 28 0.0109
## 9 Atlanta, GA (2016) Green Bay, WI (2016) FALSE 8 0.00312
## 10 Atlanta, GA (2016) Houston-New Orleans (2016) FALSE 52 0.0203
## # ... with 422 more rows
# Evaluate variable importance
helperPlotVarImp(rf_s01_nt50_TDmcwhsa$rfModel)
# Evaluate error evolution
errorEvolution(rf_s01_nt50_TDmcwhsa, useCategory="OOB", subT="Subset 01: 21 Groups of 1-3 Locales")
## # A tibble: 50 x 3
## ntree Category Error
## <int> <chr> <dbl>
## 1 1 OOB 0.581
## 2 2 OOB 0.583
## 3 3 OOB 0.579
## 4 4 OOB 0.569
## 5 5 OOB 0.561
## 6 6 OOB 0.552
## 7 7 OOB 0.541
## 8 8 OOB 0.529
## 9 9 OOB 0.517
## 10 10 OOB 0.506
## # ... with 40 more rows
A few other consolidations appear merited:
Example code includes:
# Create the locale mapper
locMapperTibble_ss02 <- tibble::tribble(
~source, ~locType,
'katl_2016', 'Same',
'kbos_2016', 'DC-PHL-EWR-BOS (2016)',
'kdca_2016', 'DC-PHL-EWR-BOS (2016)',
'kden_2016', 'Same',
'kdfw_2016', 'Dallas-San Antonio (2016)',
'kdtw_2016', 'Cold Midwest (2016)',
'kewr_2016', 'DC-PHL-EWR-BOS (2016)',
'kgrb_2016', 'Cold Midwest (2016)',
'kgrr_2016', 'Cold Midwest (2016)',
'kiah_2016', 'Gulf Coast (2016)',
'kind_2016', 'St Louis-Indianapolis (2016)',
'klas_2016', 'Las Vegas-Phoenix (2016)',
'klax_2016', 'Los Angeles-San Diego (2016)',
'klnk_2016', 'Same',
'kmia_2016', 'Gulf Coast (2016)',
'kmke_2016', 'Cold Midwest (2016)',
'kmsn_2016', 'Cold Midwest (2016)',
'kmsp_2016', 'Same',
'kmsy_2016', 'Gulf Coast (2016)',
'kord_2016', 'Cold Midwest (2016)',
'kphl_2016', 'DC-PHL-EWR-BOS (2016)',
'kphx_2016', 'Las Vegas-Phoenix (2016)',
'ksan_2016', 'Los Angeles-San Diego (2016)',
'ksat_2016', 'Dallas-San Antonio (2016)',
'ksea_2016', 'Same',
'ksfo_2016', 'Bay Area (2016)',
'ksjc_2016', 'Bay Area (2016)',
'kstl_2016', 'St Louis-Indianapolis (2016)',
'ktpa_2016', 'Gulf Coast (2016)',
'ktvc_2016', 'Same'
)
# Create locMapper
locMapper_ss02 <- locMapperTibble_ss02 %>% pull(locType)
names(locMapper_ss02) <- locMapperTibble_ss02 %>% pull(source)
locMapper_ss02
## katl_2016 kbos_2016
## "Same" "DC-PHL-EWR-BOS (2016)"
## kdca_2016 kden_2016
## "DC-PHL-EWR-BOS (2016)" "Same"
## kdfw_2016 kdtw_2016
## "Dallas-San Antonio (2016)" "Cold Midwest (2016)"
## kewr_2016 kgrb_2016
## "DC-PHL-EWR-BOS (2016)" "Cold Midwest (2016)"
## kgrr_2016 kiah_2016
## "Cold Midwest (2016)" "Gulf Coast (2016)"
## kind_2016 klas_2016
## "St Louis-Indianapolis (2016)" "Las Vegas-Phoenix (2016)"
## klax_2016 klnk_2016
## "Los Angeles-San Diego (2016)" "Same"
## kmia_2016 kmke_2016
## "Gulf Coast (2016)" "Cold Midwest (2016)"
## kmsn_2016 kmsp_2016
## "Cold Midwest (2016)" "Same"
## kmsy_2016 kord_2016
## "Gulf Coast (2016)" "Cold Midwest (2016)"
## kphl_2016 kphx_2016
## "DC-PHL-EWR-BOS (2016)" "Las Vegas-Phoenix (2016)"
## ksan_2016 ksat_2016
## "Los Angeles-San Diego (2016)" "Dallas-San Antonio (2016)"
## ksea_2016 ksfo_2016
## "Same" "Bay Area (2016)"
## ksjc_2016 kstl_2016
## "Bay Area (2016)" "St Louis-Indianapolis (2016)"
## ktpa_2016 ktvc_2016
## "Gulf Coast (2016)" "Same"
# Create the data file with locType
locData_ss02 <- metarData %>%
mutate(locType=ifelse(locMapper_ss02[source]=="Same", locale, locMapper_ss02[source]),
hr=lubridate::hour(dtime)) %>%
filter(year==2016, locType !="Exclude")
# Counts by locType
locData_ss02 %>%
count(locType) %>%
arrange(-n) %>%
as.data.frame()
## locType n
## 1 Cold Midwest (2016) 52566
## 2 Gulf Coast (2016) 35074
## 3 DC-PHL-EWR-BOS (2016) 34915
## 4 Las Vegas-Phoenix (2016) 17543
## 5 Dallas-San Antonio (2016) 17537
## 6 Los Angeles-San Diego (2016) 17536
## 7 Bay Area (2016) 17526
## 8 St Louis-Indianapolis (2016) 17422
## 9 Atlanta, GA (2016) 8772
## 10 Denver, CO (2016) 8772
## 11 Minneapolis, MN (2016) 8769
## 12 Traverse City, MI (2016) 8766
## 13 Lincoln, NE (2016) 8765
## 14 Seattle, WA (2016) 8765
The data are then sampled such that each locType is equally prevalent:
# Set a seed for reporducibility
set.seed(2007041447)
# Find the smallest locale type
nSmall <- locData_ss02 %>%
filter(!is.na(locType)) %>%
count(locType) %>%
pull(n) %>%
min()
# Create the relevant data subset
ss02_data <- locData_ss02 %>%
filter(!is.na(locType)) %>%
group_by(locType) %>%
sample_n(size=nSmall, replace=FALSE) %>%
ungroup()
# Sumarize the data subset
ss02_data %>%
count(locale) %>%
arrange(-n) %>%
as.data.frame()
## locale n
## 1 Atlanta, GA (2016) 8765
## 2 Denver, CO (2016) 8765
## 3 Lincoln, NE (2016) 8765
## 4 Minneapolis, MN (2016) 8765
## 5 Seattle, WA (2016) 8765
## 6 Traverse City, MI (2016) 8765
## 7 Indianapolis, IN (2016) 4477
## 8 San Diego, CA (2016) 4411
## 9 San Francisco, CA (2016) 4399
## 10 Dallas, TX (2016) 4393
## 11 Las Vegas, NV (2016) 4391
## 12 Phoenix, AZ (2016) 4374
## 13 San Antonio, TX (2016) 4372
## 14 San Jose, CA (2016) 4366
## 15 Los Angeles, CA (2016) 4354
## 16 Saint Louis, MO (2016) 4288
## 17 Washington, DC (2016) 2204
## 18 New Orleans, LA (2016) 2197
## 19 Houston, TX (2016) 2196
## 20 Philadelphia, PA (2016) 2196
## 21 Tampa Bay, FL (2016) 2187
## 22 Boston, MA (2016) 2186
## 23 Miami, FL (2016) 2185
## 24 Newark, NJ (2016) 2179
## 25 Chicago, IL (2016) 1525
## 26 Madison, WI (2016) 1468
## 27 Green Bay, WI (2016) 1453
## 28 Grand Rapids, MI (2016) 1445
## 29 Milwaukee, WI (2016) 1442
## 30 Detroit, MI (2016) 1432
A random forest can then be run to predict locType (using a forest of 50 trees):
# Run random forest for subset data
rf_s02_nt50_TDmcwhsa <- rfMultiLocale(ss02_data,
vrbls=c("TempF", "DewF",
"month", "hr",
"minHeight", "ceilingHeight",
"WindSpeed", "predomDir",
"modSLP", "Altimeter"
),
locs=NULL,
locVar="locType",
pred="locType",
otherVar=c("dtime", "locale", "source"),
ntree=50,
seed=2007041449,
mtry=4
)
##
## Running for locations:
## [1] "Atlanta, GA (2016)" "Bay Area (2016)"
## [3] "Cold Midwest (2016)" "Dallas-San Antonio (2016)"
## [5] "DC-PHL-EWR-BOS (2016)" "Denver, CO (2016)"
## [7] "Gulf Coast (2016)" "Las Vegas-Phoenix (2016)"
## [9] "Lincoln, NE (2016)" "Los Angeles-San Diego (2016)"
## [11] "Minneapolis, MN (2016)" "Seattle, WA (2016)"
## [13] "St Louis-Indianapolis (2016)" "Traverse City, MI (2016)"
Accuracy can then be assessed:
# Evaluate prediction accuracy
evalPredictions(rf_s02_nt50_TDmcwhsa,
plotCaption = "Temp, Dew Point, Month, Hour of Day, Cloud Height, Wind, Altimeter, and SLP",
keyVar="locType",
locOrder=NULL
)
## # A tibble: 193 x 5
## locale predicted correct n pct
## <fct> <fct> <lgl> <int> <dbl>
## 1 Atlanta, GA (2016) Atlanta, GA (2016) TRUE 2026 0.769
## 2 Atlanta, GA (2016) Bay Area (2016) FALSE 94 0.0357
## 3 Atlanta, GA (2016) Cold Midwest (2016) FALSE 13 0.00493
## 4 Atlanta, GA (2016) Dallas-San Antonio (2016) FALSE 117 0.0444
## 5 Atlanta, GA (2016) DC-PHL-EWR-BOS (2016) FALSE 48 0.0182
## 6 Atlanta, GA (2016) Denver, CO (2016) FALSE 12 0.00455
## 7 Atlanta, GA (2016) Gulf Coast (2016) FALSE 86 0.0326
## 8 Atlanta, GA (2016) Las Vegas-Phoenix (2016) FALSE 33 0.0125
## 9 Atlanta, GA (2016) Lincoln, NE (2016) FALSE 16 0.00607
## 10 Atlanta, GA (2016) Los Angeles-San Diego (2016) FALSE 54 0.0205
## # ... with 183 more rows
# Evaluate variable importance
helperPlotVarImp(rf_s02_nt50_TDmcwhsa$rfModel)
# Evaluate error evolution
errorEvolution(rf_s02_nt50_TDmcwhsa, useCategory="OOB", subT="Subset 02: 14 Groups of 1-6 Locales")
## # A tibble: 50 x 3
## ntree Category Error
## <int> <chr> <dbl>
## 1 1 OOB 0.519
## 2 2 OOB 0.519
## 3 3 OOB 0.513
## 4 4 OOB 0.505
## 5 5 OOB 0.492
## 6 6 OOB 0.479
## 7 7 OOB 0.468
## 8 8 OOB 0.457
## 9 9 OOB 0.447
## 10 10 OOB 0.435
## # ... with 40 more rows
Overall accuracy continues to improve. Further exploration may allow for continued evolution of the archetypes, particularly as related to the colder cities where accuracy is not improving with consolidation to archetypes.
Data are limited to only locales that do not meaningfully overlap with ‘Cold Midwest’. Functions are written to create the mapper, associated dataset, and associated data sample:
# Create the locale mapper
locMapperTibble_ss03 <- tibble::tribble(
~source, ~locType,
'katl_2016', 'Same',
'kbos_2016', 'Exclude',
'kdca_2016', 'Exclude',
'kden_2016', 'Same',
'kdfw_2016', 'Dallas-San Antonio (2016)',
'kdtw_2016', 'Exclude',
'kewr_2016', 'Exclude',
'kgrb_2016', 'Exclude',
'kgrr_2016', 'Exclude',
'kiah_2016', 'Gulf Coast (2016)',
'kind_2016', 'Exclude',
'klas_2016', 'Las Vegas-Phoenix (2016)',
'klax_2016', 'Los Angeles-San Diego (2016)',
'klnk_2016', 'Exclude',
'kmia_2016', 'Gulf Coast (2016)',
'kmke_2016', 'Exclude',
'kmsn_2016', 'Exclude',
'kmsp_2016', 'Exclude',
'kmsy_2016', 'Gulf Coast (2016)',
'kord_2016', 'Exclude',
'kphl_2016', 'Exclude',
'kphx_2016', 'Las Vegas-Phoenix (2016)',
'ksan_2016', 'Los Angeles-San Diego (2016)',
'ksat_2016', 'Dallas-San Antonio (2016)',
'ksea_2016', 'Same',
'ksfo_2016', 'Bay Area (2016)',
'ksjc_2016', 'Bay Area (2016)',
'kstl_2016', 'Exclude',
'ktpa_2016', 'Gulf Coast (2016)',
'ktvc_2016', 'Exclude'
)
# Create the locale mapper
locMapper_ss03 <- createLocMapper(locMapperTibble_ss03)
# Create the data file with locType
locData_ss03 <- applyLocMapper(metarData, mapper=locMapper_ss03, yearsUse=c(2016))
## locType n
## 1 Gulf Coast (2016) 35074
## 2 Las Vegas-Phoenix (2016) 17543
## 3 Dallas-San Antonio (2016) 17537
## 4 Los Angeles-San Diego (2016) 17536
## 5 Bay Area (2016) 17526
## 6 Atlanta, GA (2016) 8772
## 7 Denver, CO (2016) 8772
## 8 Seattle, WA (2016) 8765
# Create the smaller data file that is balanced by locType
ss03_data <- createBalancedSample(locData_ss03, seed=2007051311)
## locale n
## 1 Atlanta, GA (2016) 8765
## 2 Denver, CO (2016) 8765
## 3 Seattle, WA (2016) 8765
## 4 Las Vegas, NV (2016) 4406
## 5 San Diego, CA (2016) 4399
## 6 San Francisco, CA (2016) 4398
## 7 San Antonio, TX (2016) 4384
## 8 Dallas, TX (2016) 4381
## 9 San Jose, CA (2016) 4367
## 10 Los Angeles, CA (2016) 4366
## 11 Phoenix, AZ (2016) 4359
## 12 New Orleans, LA (2016) 2219
## 13 Tampa Bay, FL (2016) 2188
## 14 Houston, TX (2016) 2179
## 15 Miami, FL (2016) 2179
A random forest can then be run on both the individual locales (locData_ss03) and the balanced sample of locale types (ss03_data):
# Run random forest for subset data
rf_s03_nt100_TDmcwhsa <- rfMultiLocale(ss03_data,
vrbls=c("TempF", "DewF",
"month", "hr",
"minHeight", "ceilingHeight",
"WindSpeed", "predomDir",
"modSLP", "Altimeter"
),
locs=NULL,
locVar="locType",
pred="locType",
otherVar=c("dtime", "locale", "source"),
ntree=100,
seed=2007051315,
mtry=4
)
##
## Running for locations:
## [1] "Atlanta, GA (2016)" "Bay Area (2016)"
## [3] "Dallas-San Antonio (2016)" "Denver, CO (2016)"
## [5] "Gulf Coast (2016)" "Las Vegas-Phoenix (2016)"
## [7] "Los Angeles-San Diego (2016)" "Seattle, WA (2016)"
Accuracy can then be assessed:
# Evaluate prediction accuracy
evalPredictions(rf_s03_nt100_TDmcwhsa,
plotCaption = "Temp, Dew Point, Month, Hour of Day, Cloud Height, Wind, Altimeter, and SLP",
keyVar="locType",
locOrder=NULL
)
## Warning: Removed 1 rows containing missing values (geom_text).
## # A tibble: 63 x 5
## locale predicted correct n pct
## <fct> <fct> <lgl> <int> <dbl>
## 1 Atlanta, GA (2016) Atlanta, GA (2016) TRUE 2156 0.834
## 2 Atlanta, GA (2016) Bay Area (2016) FALSE 72 0.0279
## 3 Atlanta, GA (2016) Dallas-San Antonio (2016) FALSE 144 0.0557
## 4 Atlanta, GA (2016) Denver, CO (2016) FALSE 22 0.00851
## 5 Atlanta, GA (2016) Gulf Coast (2016) FALSE 79 0.0306
## 6 Atlanta, GA (2016) Las Vegas-Phoenix (2016) FALSE 36 0.0139
## 7 Atlanta, GA (2016) Los Angeles-San Diego (2016) FALSE 39 0.0151
## 8 Atlanta, GA (2016) Seattle, WA (2016) FALSE 37 0.0143
## 9 Bay Area (2016) Atlanta, GA (2016) FALSE 33 0.0124
## 10 Bay Area (2016) Bay Area (2016) TRUE 2228 0.837
## # ... with 53 more rows
# Evaluate variable importance
helperPlotVarImp(rf_s03_nt100_TDmcwhsa$rfModel)
# Evaluate error evolution
errorEvolution(rf_s03_nt100_TDmcwhsa,
useCategory="OOB",
subT="Exclude Locales with High Overlap to Cold Midwest"
)
## # A tibble: 100 x 3
## ntree Category Error
## <int> <chr> <dbl>
## 1 1 OOB 0.327
## 2 2 OOB 0.331
## 3 3 OOB 0.322
## 4 4 OOB 0.312
## 5 5 OOB 0.301
## 6 6 OOB 0.290
## 7 7 OOB 0.281
## 8 8 OOB 0.268
## 9 9 OOB 0.256
## 10 10 OOB 0.250
## # ... with 90 more rows
Predictions continue to be highly accurate for Denver, Seattle, and Las Vegas-Phoenix. There is some mixing of the Pacific coastal cities - Bay Area sometimes as Seattle or LA/San Diego, and LA/San Diego sometimes as Bay Area. There is also some meaningful overlap among Dallas-San Antonio, Gulf Coast, and Atlanta.
The process can then be repeated for the individual cities:
# Run random forest for subset data
rf_l03_nt100_TDmcwhsa <- rfMultiLocale(locData_ss03,
vrbls=c("TempF", "DewF",
"month", "hr",
"minHeight", "ceilingHeight",
"WindSpeed", "predomDir",
"modSLP", "Altimeter"
),
locs=NULL,
locVar="locale",
pred="locale",
otherVar=c("dtime", "locale", "source"),
ntree=100,
seed=2007051325,
mtry=4
)
##
## Running for locations:
## [1] "Atlanta, GA (2016)" "Dallas, TX (2016)"
## [3] "Denver, CO (2016)" "Houston, TX (2016)"
## [5] "Las Vegas, NV (2016)" "Los Angeles, CA (2016)"
## [7] "Miami, FL (2016)" "New Orleans, LA (2016)"
## [9] "Phoenix, AZ (2016)" "San Antonio, TX (2016)"
## [11] "San Diego, CA (2016)" "San Francisco, CA (2016)"
## [13] "San Jose, CA (2016)" "Seattle, WA (2016)"
## [15] "Tampa Bay, FL (2016)"
Accuracy can then be assessed:
# Evaluate prediction accuracy
evalPredictions(rf_l03_nt100_TDmcwhsa,
plotCaption = "Temp, Dew Point, Month, Hour of Day, Cloud Height, Wind, Altimeter, and SLP",
keyVar="locale",
locOrder=NULL
)
## Warning: Removed 1 rows containing missing values (geom_text).
## # A tibble: 216 x 5
## locale predicted correct n pct
## <fct> <fct> <lgl> <int> <dbl>
## 1 Atlanta, GA (2016) Atlanta, GA (2016) TRUE 2064 0.775
## 2 Atlanta, GA (2016) Dallas, TX (2016) FALSE 125 0.0469
## 3 Atlanta, GA (2016) Denver, CO (2016) FALSE 23 0.00864
## 4 Atlanta, GA (2016) Houston, TX (2016) FALSE 41 0.0154
## 5 Atlanta, GA (2016) Las Vegas, NV (2016) FALSE 27 0.0101
## 6 Atlanta, GA (2016) Los Angeles, CA (2016) FALSE 49 0.0184
## 7 Atlanta, GA (2016) Miami, FL (2016) FALSE 18 0.00676
## 8 Atlanta, GA (2016) New Orleans, LA (2016) FALSE 34 0.0128
## 9 Atlanta, GA (2016) Phoenix, AZ (2016) FALSE 24 0.00901
## 10 Atlanta, GA (2016) San Antonio, TX (2016) FALSE 54 0.0203
## # ... with 206 more rows
# Evaluate variable importance
helperPlotVarImp(rf_l03_nt100_TDmcwhsa$rfModel)
# Evaluate error evolution
errorEvolution(rf_l03_nt100_TDmcwhsa,
useCategory="OOB",
subT="Exclude Locales with High Overlap to Cold Midwest"
)
## # A tibble: 100 x 3
## ntree Category Error
## <int> <chr> <dbl>
## 1 1 OOB 0.442
## 2 2 OOB 0.440
## 3 3 OOB 0.431
## 4 4 OOB 0.428
## 5 5 OOB 0.418
## 6 6 OOB 0.406
## 7 7 OOB 0.394
## 8 8 OOB 0.380
## 9 9 OOB 0.368
## 10 10 OOB 0.358
## # ... with 90 more rows
At a glance, the combinations chosen from the individual cities seem sensible given how the individual cities tend to (mis)classify.
Random forests can also be used to run regressions, such as on variables like temperature or dew point. Suppose that a few data elements are available, and the goal is to predict the temmperature based on those:
# Build a regression dataset with factored variables for locale and source
regrData <- metarData %>%
mutate(hr=lubridate::hour(lubridate::round_date(dtime, unit="1 hour")),
hrfct=factor(hr),
localefct=factor(locale),
locName=str_replace(locale, pattern=" \\(\\d{4}\\)", replacement=""),
locNamefct=factor(locName)
)
# Define a dependent variable and predictor variables and filtering criteria
depVar <- "TempF"
predVars <- c("DewF", "month", "hrfct", "locNamefct")
critFilter <- list("year"=c(2016),
"locNamefct"=c("Chicago, IL", "Las Vegas, NV", "San Diego, CA", "New Orleans, LA")
)
# Filter such that only non-NA data across all of depVar and predVars are included
subRegData <- regrData %>%
filter_at(vars(all_of(c(depVar, predVars, names(critFilter)))), all_vars(!is.na(.)))
# Filter such that only matches to critFilter are included
for (xNum in seq_len(length(critFilter))) {
subRegData <- subRegData %>%
filter_at(vars(all_of(names(critFilter)[xNum])), ~. %in% critFilter[[xNum]])
}
# Split in to a test set and a train set
testSize <- 0.3
set.seed(2007151232)
idxTrain <- sample(1:nrow(subRegData), size=round((1-testSize)*nrow(subRegData)), replace=FALSE)
trainRegData <- subRegData[idxTrain, ] # will be in random order
testRegData <- subRegData[-idxTrain, ] # will be in sorted order
# Run random forest regression
rfRegrData <- randomForest::randomForest(x=trainRegData[, predVars, drop=FALSE],
y=trainRegData[, depVar, drop=TRUE],
ntree=25,
mtry=2
)
# Assess variable importance
rfRegrData %>%
caret::varImp()
## Overall
## DewF 1940150.5
## month 2886275.7
## hrfct 543114.8
## locNamefct 1848806.0
# Assess evolution of OOB r-squared and RMSE
tibble::tibble(ntree=1:25, rmse=rfRegrData$mse**0.5) %>%
ggplot(aes(x=ntree, y=rmse)) +
geom_point() +
ylim(c(0, NA)) +
labs(x="# Trees", y="RMSE", title="Evolution of RMSE")
tibble::tibble(ntree=1:25, rmse=rfRegrData$rsq) %>%
ggplot(aes(x=ntree, y=rmse)) +
geom_point() +
ylim(c(NA, 1)) +
labs(x="# Trees", y="R-squared", title="Evolution of R-squared")
# Apply to test-set data and report metrics
rfPreds <- testRegData %>%
mutate(pred=predict(rfRegrData, newdata=testRegData[, predVars, drop=FALSE]),
globMean=mean(get(depVar))
) %>%
group_by(locale) %>%
mutate(locMean=mean(get(depVar))) %>%
ungroup() %>%
mutate(errActualGlob=(get(depVar)-globMean)**2,
errActualLoc=(get(depVar)-locMean)**2,
errPredict=(get(depVar)-pred)**2
)
# Calculate the errors by locale
rfError <- rfPreds %>%
select(locale, starts_with("err")) %>%
group_by(locale) %>%
summarize_all(~mean(.)**0.5)
# Plot errors by locale
rfError %>%
pivot_longer(-locale, names_to="errorType", values_to="rmse") %>%
ggplot(aes(x=locale, y=rmse, fill=errorType)) +
geom_col(position="dodge") +
labs(x="", y="Average Error", title="Prediction Accuracy by Locale") +
scale_fill_discrete("Actual vs.",
c("errActualGlob", "errActualLoc", "errPredict"),
c("Global Mean", "Locale Mean", "Predicted")
)
# Print errors by locale, including percentage changes
rfError %>%
mutate(pctLocGlob=errActualLoc/errActualGlob,
pctPredLoc=errPredict/errActualLoc,
pctPredGlob=errPredict/errActualGlob
)
## # A tibble: 4 x 7
## locale errActualGlob errActualLoc errPredict pctLocGlob pctPredLoc pctPredGlob
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Chica~ 24.5 20.7 5.47 0.844 0.264 0.223
## 2 Las V~ 19.5 18.5 5.74 0.951 0.310 0.295
## 3 New O~ 14.9 13.1 3.69 0.879 0.282 0.248
## 4 San D~ 6.82 6.82 3.41 1.00 0.499 0.499
A simple random forest significantly reduces null error in predicting temperatures, particularly for Chicago, Las Vegas, and New Orleans. This suggests knowing the dew point and the month enhance predictions for seasonal locales, while being of less use in a marine climate with modest seasonal variation.
Note that San Diego has more accurate predictions than any other locale, it is just that San Diego does not need much more than locale to be accurate. In contrast, Chicago averages ~25 degree error if using global mean, ~20 degrees error if using local mean, and ~5 degrees error if using dew point, month, and hour.
Notably, it appears that running with just a single tree explains nearly 90% of the variance and drives average prediction errors down to under 6 degrees. While there is evolution with additional trees, it appears modest. Potentially, large forests may not be less necessary for this regression analysis phase.
The above process is converted to functional form, with the main function running the filtering, then calling a modified rfTwoLocale() that is adapted to handle classification or regression. The function reRegression() is available in WeatherModelingFunctions_v001.R.
The regression is run again using the same arguments with the new function, and results are cached:
rfRegrData2 <- rfRegression(regrData,
depVar=depVar,
predVars=predVars,
critFilter=critFilter,
seed=2007151232,
ntree=25,
mtry=2,
testSize=0.3
)
The general outcomes are nearly identical:
# Assess variable importance
rfRegrData2$rfModel %>%
caret::varImp()
## Overall
## DewF 1878740.7
## month 2945222.5
## hrfct 545891.1
## locNamefct 1886122.8
# Assess evolution of OOB r-squared and RMSE
tibble::tibble(ntree=1:25, rmse=rfRegrData2$mse**0.5) %>%
ggplot(aes(x=ntree, y=rmse)) +
geom_point() +
ylim(c(0, NA)) +
labs(x="# Trees", y="RMSE", title="Evolution of RMSE")
tibble::tibble(ntree=1:25, rmse=rfRegrData2$rsq) %>%
ggplot(aes(x=ntree, y=rmse)) +
geom_point() +
ylim(c(NA, 1)) +
labs(x="# Trees", y="R-squared", title="Evolution of R-squared")
An evaluation function suitable for regresion is written, to cover the following metrics:
Each sub-section is written as its own function, with a wrapper function that calls then all together. The below functions are available in WeatherModelingFunctions_v001.R:
Example code for running the functions alone and together includes:
# Run function #1
plotVariableImportance(rfRegrData,
returnData=TRUE,
caption="temperature ~ f(locale, month, hour, dew point)",
titleAdd=": Predicting Temperature"
)
## # A tibble: 4 x 2
## predictor Overall
## <chr> <dbl>
## 1 DewF 1940151.
## 2 month 2886276.
## 3 hrfct 543115.
## 4 locNamefct 1848806.
# Run function #2
plotMSERSQ(rfRegrData, returnData=TRUE)
## # A tibble: 25 x 3
## ntree rmse rsq
## <int> <dbl> <dbl>
## 1 1 5.78 0.891
## 2 2 5.64 0.897
## 3 3 5.51 0.901
## 4 4 5.39 0.906
## 5 5 5.29 0.909
## 6 6 5.23 0.911
## 7 7 5.19 0.913
## 8 8 5.14 0.914
## 9 9 5.07 0.916
## 10 10 5.04 0.917
## # ... with 15 more rows
# Run function #3
plotErrorByVar(rfRegrData2, depVar="TempF", keyVar="locNamefct", returnData=TRUE)
## # A tibble: 4 x 5
## locNamefct varTot varMod rmse rsq
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 Chicago, IL 427. 29.9 5.46 0.930
## 2 Las Vegas, NV 343. 32.9 5.73 0.904
## 3 New Orleans, LA 171. 13.7 3.70 0.920
## 4 San Diego, CA 46.6 11.8 3.43 0.747
# Run function #4
plotActualVsPredicted(rfRegrData2, depVar="TempF", facetVar="locNamefct", returnData=TRUE)
## # A tibble: 10,495 x 9
## actual source dtime DewF month hrfct locNamefct predicted
## <dbl> <chr> <dttm> <dbl> <fct> <fct> <fct> <dbl>
## 1 41 klas_~ 2016-01-01 03:56:00 8.06 Jan 4 Las Vegas~ 42.9
## 2 39.0 klas_~ 2016-01-01 04:56:00 8.06 Jan 5 Las Vegas~ 37.5
## 3 32 klas_~ 2016-01-01 10:56:00 8.96 Jan 11 Las Vegas~ 34.8
## 4 30.0 klas_~ 2016-01-01 11:56:00 8.96 Jan 12 Las Vegas~ 35.1
## 5 42.1 klas_~ 2016-01-01 18:56:00 8.96 Jan 19 Las Vegas~ 45.4
## 6 46.0 klas_~ 2016-01-01 20:56:00 8.96 Jan 21 Las Vegas~ 49.6
## 7 48.0 klas_~ 2016-01-01 22:56:00 10.9 Jan 23 Las Vegas~ 53.8
## 8 45.0 klas_~ 2016-01-02 01:56:00 10.9 Jan 2 Las Vegas~ 49.4
## 9 41 klas_~ 2016-01-02 02:56:00 12.9 Jan 3 Las Vegas~ 51.0
## 10 39.9 klas_~ 2016-01-02 08:56:00 15.1 Jan 9 Las Vegas~ 43.5
## # ... with 10,485 more rows, and 1 more variable: correct <lgl>
# Run combined function
rfOut_002 <- evalRFRegression(rfRegrData2,
depVar="TempF",
keyVar="locNamefct",
facetVar="locNamefct",
caption="Temp ~ f(locale, month, dew point, hour)",
returnData=TRUE
)
The random forest is then run for an expanded list of locales in 2016, with results cached:
keyLocs3 <- c("Chicago, IL", "Las Vegas, NV", "San Diego, CA", "New Orleans, LA", "Atlanta, GA", "Denver, CO", "Dallas, TX", "Miami, FL", "Phoenix, AZ", "Seattle, WA", "Boston, MA", "San Francisco, CA")
# Run for select 2016 data
rfRegrData3 <- rfRegression(regrData,
depVar="TempF",
predVars=c("DewF", "month", "hrfct", "locNamefct"),
critFilter=list(year=2016, locNamefct=keyLocs3),
seed=2007171405,
ntree=20,
mtry=2,
testSize=0.3
)
The results can then be evaluated:
rfOut_003 <- evalRFRegression(rfRegrData3,
depVar="TempF",
keyVar="locNamefct",
facetVar="locNamefct",
caption="Temp ~ f(locale, month, dew point, hour)",
returnData=TRUE
)
Month, locale, and dewpoint are all relatively high in variable importance, while hour is low. This is somewhat surprising given that there are usually diurnal temperature changes, though the results suggest that month and dew point have greater impact on temperatures. Denver is the hardest locale for the model to predict.
The process is run again with hour excluded to see how that impacts the results:
keyLocs3a <- c("Chicago, IL", "Las Vegas, NV", "San Diego, CA", "New Orleans, LA", "Atlanta, GA", "Denver, CO", "Dallas, TX", "Miami, FL", "Phoenix, AZ", "Seattle, WA", "Boston, MA", "San Francisco, CA")
# Run for select 2016 data
rfRegrData3a <- rfRegression(regrData,
depVar="TempF",
predVars=c("DewF", "month", "locNamefct"),
otherVar=c("source", "dtime", "hrfct"),
critFilter=list(year=2016, locNamefct=keyLocs3a),
seed=2007171405,
ntree=20,
mtry=2,
testSize=0.3
)
And, the results are again evaluated:
rfOut_003a <- evalRFRegression(rfRegrData3a,
depVar="TempF",
keyVar="locNamefct",
facetVar="locNamefct",
caption="Temp ~ f(locale, month, dew point)",
returnData=TRUE
)
Run time is significantly faster with the elimination of the 24-category factor, hour. Interestingly, there is no longer mush change in RMSE or R-squared with more trees, suggestive that the main work being done in the previous model is to incorporate the impact of hour. Removing hour meaningfullky degrades both RMSE (~1.5 degrees more delta) and R-squared (~5% less variance explained). Several locales no longer have prediction smooths that follow the 45-degree line, suggesting that hour may be especially helpful in specific locales. Next steps are to explore this further.